How can you filter a Django query's joined tables then iterate the joined tables in one query? - django

I have table Parent, and a table Child with a foreign key to table Parent.
I want to run a query for all Parents with a child called Eric, and report Eric's age.
I run:
parents = Parents.objects.filter(child__name='Eric')
I then iterate over the queryset:
for parent in parents:
print(f'Parent name {parent.name} child Eric age {parent.child.age}')
Clearly this doesn't work - I need to access child through the foreign key object manager, so I try:
for parent in parents:
print(f'Parent name {parent.name}')
for child in parent.child_set.all():
print(f'Child Eric age {parent.child.age}')
Django returns all children's ages, not just children named Eric.
I can repeat the filter conditions:
parents = Parents.objects.filter(child__name='Eric')
for parent in parents:
print(f'Parent name {parent.name}')
for child in parent.child_set.filter(name='Eric'):
print(f'Child Eric age {child.age}')
But this means duplicate code (so risks future inconsistency when another dev makes a change to one not the other), and runs a second query on the database.
Is there a way of getting the matching records and iterating over them? Been Djangoing for years and can't believe I can't do this!
PS. I know that I can do Child.objects.filter(name='Eric').select_related('parent'). But what I would really like to do involves a second child table. So add to the above example a table Address with a foreign key to Parent. I want to get parents with children named Eric and addresses in Timbuktu and iterate over the all Timbuktu addresses and all little Erics. This is why I don't want to use Child's object manager.
This is the best I could come up with - three queries, repeating each filter.
children = Children.objects.filter(name='Eric')
addresses = Address.objects.filter(town='Timbuktu')
parents=(
Parent.objects
.filter(child__name='Eric', address__town='Timbuktu')
.prefetch_related(Prefetch('child_set', children))
.prefetch_related(Prefetch('address_set', addresses))
)

The .values function gives you direct access to the recordset returned (thank you #Iain Shelvington):
parents_queryset_dicts = Parent.objects
.filter(child__name='Eric', address__town='Timbuktu')
.values('id', 'name', 'child__id', 'address__id', 'child__age', 'address__whatever')
.order_by('id', 'child__id', 'address__id')
Note though that this retrieves a Cartesian product of children and addresses, so our gain in reduced query count is slightly offset by double-sized result sets and de-duplication below. So I am starting to think two queries using Child.objects and Address.objects is superior - slightly slower but simpler code.
In my actual use case I have multiple, multi-table chains of foreign key joins, so am splitting the query to prevent the Cartesian join, but still making use of the .values() approach to get filtered, nested tables.
If you then want a hierarchical structure, eg, for sending as JSON to the client, to produce:
parents = {
parent_id: {
'name': name,
'children': {
child_id: {
'age': child_age
},
'addresses': {
address_id: {
'whatever': address_whatever
},
},
},
}
Run something like:
prev_parent_id = prev_child_id = prev_address_id = None
parents = {}
for parent in parents_queryset_dicts:
if parent['id'] != prev_parent_id:
parents[parent['id']] = {'name': parent['name'], children: {}, addresses: {}}
prev_parent_id = parent['id']
if parent['child__id'] != prev_child_id:
parents[parent['id']]['children'][parent['child__id']] = {'age': parent['child__age']}
prev_child_id = parent['child__id']
if parent['address__id'] != prev_address_id:
parents[parent['id']]['addresses'][parent['address__id']] = {'whatever': parent['address__whatever']}
prev_address_id = parent['address__id']
This is dense code, and you no longer get access to any fields not explicitly extracted and copied in, including any nested ~_set querysets, and de-duplication of the Cartesian product is not obvious to later developers. You can grab the queryset, keep it, then extract the .values, so you have both from the same, single, database query. But often the three query repeated filters is a bit cleaner, if a couple database queries less efficient:
children = Children.objects.filter(name='Eric')
addresses = Address.objects.filter(town='Timbuktu')
parents_queryset = (
Parent.objects
.filter(child__name='Eric', address__town='Timbuktu')
.prefetch_related(Prefetch('child_set', children))
.prefetch_related(Prefetch('address_set', addresses))
)
parents = {}
for parent in parents_queryset:
parents[parent.id] = {'name': parent['name'], children: {}, addresses: {}}
for child in parent.child_set: # this is implicitly filtered
parents[parent.id]['children'][child.id] = {'age': child.age}
for address in parent.address_set: # also implicitly filtered
parents[parent.id]['addresses'][address.id] = {'whatever': address.whatever}
One last approach, which someone briefly posted then deleted - I'd love to know why - is using annotate and F() objects. I have not experimented with this, the SQL generated looks fine though and it seems to run a single query and not require repeating filters:
from django.db.models import F
parents = (
Parent.objects.filter(child__name='Eric')
.annotate(child_age=F('child__age'))
)
Pros and cons seem identical to .values() above, although .values() seems slightly more basic Django (so easier to read) and you don't have to duplicate field names (eg, with the obfuscation above of child_age=child__age). Advantages might be convenience of . accessors instead of ['field'], you keep hold of the lazy nested recordsets, etc. - although if you're counting the queries you probably want things to fall over if you issue an accidental query per row.

Related

how to prefetch parents of a child on mptt tree with django prefetch_related?

Say, I've got a product instance. Product instance linked to 4th level child category. If I only want to get root category and 4th level child category the query below is enough to fetch data with minimum database queries:
Product.objects.filter(active=True).prefetch_related('category__root',
'category')
If I have to reach parents of this product's category and using get_ancestors() method for this, nearly three times mode database query happening.
If I write query like below instead using get_ancestors() method, database queries stays low.
Product.objects.filter(active=True).prefetch_related(
'category__root',
'category',
'category__parent',
'category__parent__parent',
'category__parent__parent__parent',
'category__parent__parent__parent__parent')
But this query is not effective when level of depth is unknown.
Is there a way to prefetch parents dynamically in the query above?
Old question, but I'll try to give it a go.
This will require an extra query though. (But that's better than the possible hundreds. - if not more.)
Some explanation;
First: We will need to determine how many levels deep the categories are for the active products.
To avoid the extra query every time, you could cache the following code at startup if categories are static.
max_level = Category.objects.filter(product_set__active=True)\ # Reverse lookup on product
.values('level')\
.aggregate(
max_level=models.Max('level')
)['max_level']
Second: We will need to create the prefetch string based on the levels. The maximum amount of levels is equal to the maximum amount of parents.
level 0 = no parents
level 1 = 1 parent
level 2 = 2 parents (nested)
level 3 = 3 parents (nested)
This means that we can easily loop over the range of the levels, and append the parent (string) to a list.
prefetch_string = 'category'
prefetch_list = []
for i in range(max_level):
prefetch_string += '__parent'
prefetch_list.append(prefetch_string)
Third: We pass in the prefetch_list, also unpacking it.
Product.objects.filter(active=True).prefetch_related(
'category__root',
'category',
*prefetch_list) # unpack the list into args.
We can then easily refactor this into a single dynamic function.
def get_mptt_prefetch(field_name, lookup_name='__parent', related_model_qs=None):
max_level = related_model_qs\
.values('level')\
.aggregate(
max_level=models.Max('level')
)['max_level']
prefetch_list = []
prefetch_string = field_name
for i in range(max_level):
prefetch_string += lookup_name
prefetch_list.append(prefetch_string)
return prefetch_list
And then you can get the prefetch list with:
prefetch_list = get_mptt_prefetch(
'category',
related_model_qs=Category.objects.filter(product_set__active=True), # To only get categories which contain active products.
)
https://django-mptt.readthedocs.io/en/latest/technical_details.html#level

Combine and flatten many key/value tuples into a single tuple in pig

I am using Pig 0.8.1. I am somewhat new to Pig but I know there must be a reasonable and re-usable solution for how I want to work with my tuples. I have the following format (similar to triples):
Schema: (uuid, key, val)
Data:
(id1, 'name', 'Corey')
(id1, 'location', 'USA')
(id1, 'carsOwned', 5)
(id2, 'name', 'Paul')
(id2, 'location', 'CANADA')
(id2, 'carsOwned', 10)
The reason I'm representing this data in triples is because it's possible to have multi-valued keys, so pushing the data into a map is out of the question.
What I need to be able to do is find the ids, names and locations of the people with the top 10 cars owned. I'd like it if my output format could be this when sorted in descending order:
Schema: (uuid, name, location, carsOwned)
Data:
(id2, 'Paul', 'CANADA', 10)
(id1, 'Corey', 'USA', 5)
I have tried filtering my input into 3 different aliases (one where key == 'name', one where key == 'location' and one where key == 'carsOwned') so that I can use JOIN and bring them back into one tuple, but it appears that Pig ends up loading from the inputFormat 3 times instead of one. Maybe I'm doing that wrong?
I also tried grouping by id but then I can't seem to find a reasonable way to work with the bag of the actual triple key/values since they all have the exact same schema.
What I really want is to group by the id field and then flatten each of the keys but rename the alias to the actual name of the key.
Any ideas? Thanks in advance!
This solution is a bit sloppy, because your data is not organized in a way that Pig is really set up for -- i.e., conceptually each id show be a row key, with the fields named by what you have in the second column. But this can still be done, as long as your data is all reasonable. If you erroneously wind up with multiple rows with the same id and field name, but different values, this will get complicated.
Use a nested foreach to pick out the values for the three fields you're interested in.
keyedByID =
/* Gather the rows by ID, then process each one in turn */
FOREACH (GROUP Data BY id) {
/* Pull out the fields you want. If you have duplicate rows,
you'll need to use a LIMIT statement to ensure just a single record */
name = FILTER Data BY field == 'name';
location = FILTER Data BY field == 'location';
carsOwned = FILTER Data BY field == 'carsOwned';
GENERATE
/* Output each field you want. You'll need to use FLATTEN since
the things created above in the nested foreach are bags. */
group AS id,
FLATTEN(name) AS name,
FLATTEN(locatioN) AS location,
FLATTEN(carsOwned) AS carsOwned;
};
Now you've got a relation that puts all the information for an ID on a single row, and you can do with it whatever you want. For example, you said wanted to pull out the top 10 car owners:
ordered = ORDER keyedByID BY carsOwned DESC;
top10 = LIMIT ordered 10;

Sort by number of matches on queries based on m2m field

I hope the title is not misleading.
Anyway, I have two models, both have m2m relationships with a third model.
class Model1: keywords = m2m(Keyword)
class Model2: keywords = m2m(Keyword)
Given the keywords for a Model2 instance like this:
keywords2 = model2_instance.keywords.all()
I need to retrieve the Model1 instances which have at least a keyword that is in keywords2, something like:
Model1.objects.filter(keywords__in=keywords2)
and sort them by the number of keywords that match (dont think its possible via 'in' field lookup). Question is, how do i do this?
I'm thinking of just manually interating through each of Model1 instances, appending them to a dictionary of results for every match, but I need this to scale, for say tens of thousands of records. Here is how I imagined it would be like:
result = {}
keywords2_ids = model2.keywords.all().values_list('id',flat=True)
for model1 in Model1.objects.all():
keywords_matched = model1.keywords.filter(id__in=keywords2_ids).count()
objs = result.get(str(keywords_matched), [])
result[str(keywords_matched)] = objs.append(obj)
There must be an faster way to do this. Any ideas?
You can just switch to raw SQL. What you have to do is to write a custom manager for Model1 to return the sorted set of ids of Model1 objects based on the keyword match counts. The SQL is simple as joining the two many to many tables(Django automatically creates a table to represent a many to many relationship) on keyword ids and then grouping on Model1 ids for COUNT sql function. Then using an ORDER BY clause on those counts will produce the sorted Model1 id list you need. In MySQL,
SELECT appname_model1_keywords.model1_id, count(*) as match_count FROM appname_model1_keywords
JOIN appname_model2_keywords
ON (appname_model1_keywords.keyword_id = appname_model2_keywords.keyword_id)
WHERE appname_model2_keywords.model2_id = model2_object_id
GROUP BY appname_model1_keywords.model1_id
ORDER BY match_count
Here model2_object_id is the model2_instance id. This will definitely be faster and more scalable.

How do I get the related objects In an extra().values() call in Django?

Thank to this post I'm able to easily do count and group by queries in a Django view:
Django equivalent for count and group by
What I'm doing in my app is displaying a list of coin types and face values available in my database for a country, so coins from the UK might have a face value of "1 farthing" or "6 pence". The face_value is the 6, the currency_type is the "pence", stored in a related table.
I have the following code in my view that gets me 90% of the way there:
def coins_by_country(request, country_name):
country = Country.objects.get(name=country_name)
coin_values = Collectible.objects.filter(country=country.id, type=1).extra(select={'count': 'count(1)'},
order_by=['-count']).values('count', 'face_value', 'currency_type')
coin_values.query.group_by = ['currency_type_id', 'face_value']
return render_to_response('icollectit/coins_by_country.html', {'coin_values': coin_values, 'country': country } )
The currency_type_id comes across as the number stored in the foreign key field (i.e. 4). What I want to do is retrieve the actual object that it references as part of the query (the Currency model, so I can get the Currency.name field in my template).
What's the best way to do that?
You can't do it with values(). But there's no need to use that - you can just get the actual Collectible objects, and each one will have a currency_type attribute that will be the relevant linked object.
And as justinhamade suggests, using select_related() will help to cut down the number of database queries.
Putting it together, you get:
coin_values = Collectible.objects.filter(country=country.id,
type=1).extra(
select={'count': 'count(1)'},
order_by=['-count']
).select_related()
select_related() got me pretty close, but it wanted me to add every field that I've selected to the group_by clause.
So I tried appending values() after the select_related(). No go. Then I tried various permutations of each in different positions of the query. Close, but not quite.
I ended up "wimping out" and just using raw SQL, since I already knew how to write the SQL query.
def coins_by_country(request, country_name):
country = get_object_or_404(Country, name=country_name)
cursor = connection.cursor()
cursor.execute('SELECT count(*), face_value, collection_currency.name FROM collection_collectible, collection_currency WHERE collection_collectible.currency_type_id = collection_currency.id AND country_id=%s AND type=1 group by face_value, collection_currency.name', [country.id] )
coin_values = cursor.fetchall()
return render_to_response('icollectit/coins_by_country.html', {'coin_values': coin_values, 'country': country } )
If there's a way to phrase that exact query in the Django queryset language I'd be curious to know. I imagine that an SQL join with a count and grouping by two columns isn't super-rare, so I'd be surprised if there wasn't a clean way.
Have you tried select_related() http://docs.djangoproject.com/en/dev/ref/models/querysets/#id4
I use it a lot it seems to work well then you can go coin_values.currency.name.
Also I dont think you need to do country=country.id in your filter, just country=country but I am not sure what difference that makes other than less typing.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])