Django bulk update with data two tables over

Django bulk update with data two tables over - django

I want to bulk update a table with data two tables over. A solution has been given for the simpler case mentioned in the documentation of:
Entry.objects.update(headline=F('blog__name'))
For that solution, see
https://stackoverflow.com/a/50561753/1092940
Expanding from the example, imagine that Entry has a Foreign Key reference to Blog via a field named blog, and that Blog has a Foreign Key reference to User via a field named author. I want the equivalent of:
Entry.objects.update(author_name=F('blog__author__username'))
As in the prior solution, the solution is expected to employ SubQuery and OuterRef.
The reason I ask here is because I lack confidence where this problem starts to employ multiple OuterRefs, and confusion arises about which outer ref it refers to.

The reason I ask here is because I lack confidence where this problem starts to employ multiple OuterRefs, and confusion arises about which outer ref it refers to.
It does not require multiple outer references, you can update with:
from django.db.models import OuterRef, Subquery
author_name = Author.objects.filter(
blogs__id=OuterRef('blog_id')
).values_list(
'username'
)[:1]
Entry.objects.update(
author_name=Subquery(author_name)
)
You here thus specify that you look for an Author with a related Blog with an id equal to the blog_id of the Entry.

Related

How could I translate a spatial JOIN into Django query language?

Context
I have two tables app_area and app_point that are not related in any way (no Foreign Keys) expect each has a geometry field so we could spatially query them, respectively of polygon and point type. The bare model looks like:
from django.contrib.gis.db import models
class Point(models.Model):
# ...
geom = models.PointField(srid=4326)
class Area(models.Model):
# ...
geom = models.PolygonField(srid=4326)
I would like to create a query which filters out points that are not contained in polygon.
If I had to write it with a Postgis/SQL statement to perform this task I would issue this kind of query:
SELECT
P.*
FROM
app_area AS A JOIN app_point AS P ON ST_Contains(A.geom, P.geom);
Which is simple and efficient when spatial indices are defined.
My concern is to write this query without hard coded SQL in my Django application. Therefore, I would like to delegate it to the ORM using the classical Django query syntax.
Issue
I could not find a clear example of this kind of query on the internet, solutions I have found:
Either rely on predefined relation using ForeignKeyField or prefetch_related (but this relation does not exist in my case);
Or use a single hand crafted geometry to represent the polygon (but this is not my use case as I want to rely on another table as the polygon source).
I have the feeling that is definitely achievable with Django but maybe I am too new to this framework or it is not enough documented or I have not found the rightful keywords set to google it.
The best I could find in the official documentation is the FilteredRelation object which seems to do what I want: defining the ON part of the JOIN clause, but I could not setup properly, mainly I don't understand how to fill the other table and point to proper fields.
from django.db.models import F, Q, FilteredRelation
query = Location.objects.annotate(
campus=FilteredRelation(<relation_name>, condition=Q(geom__contains=F("geom")))
)
Mainly the field relation_name puzzle me. I would expect it to be the table I would join (here Area) on but it seems it is a column name which is expected.
django.core.exceptions.FieldError: Cannot resolve keyword 'Area' into field. Choices are: created, geom, id, ...
But this list of fields are from Point table.
My question is: How could I translate my spatial JOIN into Django query language?
Nota: There is no requirement to rely on FilteredRelation object, this is just the best match I have found for now!
Update
I am able to emulate the expected output using extra:
results = models.Point.objects.extra(
where=["ST_intersects(app_area.geom, app_point.geom)"],
tables=["app_area"]
)
Which returns a QuerySet but it still needs to inject plain SQL statement and then the generated SQL are not equivalent in term of clauses:
SELECT "app_point"."id", "app_point"."geom"::bytea
FROM "app_point", "app_area"
WHERE (ST_intersects(app_area.geom, app_point.geom))
And EXPLAIN performances.

I think, the best solution would be to aggregate the areas and then do an intersection with the points.
from django.db.models import Q
from django.contrib.gis.db.models.aggregates import Union
multipolygon_area = Area.objects.aggregate(area=Union("geom"))["area"]
# Get all points inside areas
Points.objects.filter(geom__intersects=multipolygon_area)
# Get all points outside areas
Points.objects.filter(~Q(geom__intersects=multipolygon_area))
This is quite efficient as it is completely calculated on the database level.
The idea was found here

How to annotate a distinct Count over multiple relationships in Django?

Given a model that has more than one kind of connection to a related model (I will call the "parent" model), how could I annotate a queryset with a count of parent model objects that are linked through either connection without counting duplicates?
Example model definitions
Consider an Article model that has 2 links to a parent Publication model that are very similar in meaning.
from django.db import models
class Publication(models.Model):
pass
class Article(models.Model):
publication = models.ForeignKey(Publication, related_name='publications')
owner = models.ForeignKey(Publication, related_name='owned_articles')
Objective
I want to serve a page that is a list of publications. A business requirement is that the number of articles that the publication wishes to take credit for are shown (these publications prefer a generous metric for counting). An article is considered part of the organization if either the "owner" or "publication" field points to it, but no articles should be counted more than once for a single publication. An article may be included in the count of 2 publications if publication points to a different object than owner.
I don't want to execute a query for every publication in the list.
The problem with Count annotations here
Publication.objects.annotate(Count('publications'), Count('owned_articles')) would be trivial. Then I will have count__publications and count__owned_articles.
My problem is that I can't tell how many articles in count__publications were also counted in count__owned_articles. Django doesn't allow me to cram a full queryset into Count, so in this general case of needing extra control of what is counted a special mechanism is needed.
Similar questions
I have found this situation most similar to the question here:
Django annotate count with a distinct field
You could contrive this same general situation by intensifying that question's request by adding another related model in addition to InformationUnit and asking for a count of unique usernames among both related models.

(initial answer, answering my own question with a so-so solution)
The preferable approach would be to start with a Publication queryset, however, I can manage to squeeze a solution out of the Django ORM by pivoting around the Article queryset instead.
Consider as a solution to this problem:
exclusive_owners_qs = Article.objects.exclude(
publication=F('owner')
).annotate(Count('publication')).order_by('publication')
publications_qs = Article.objects.annotate(Count('owner')).order_by('owner')
With this, I can loop over the two querysets and add up the 2 numbers locally inside of python to get the correct counts.
This satisfies the requirements, but it's also not an elegant solution. Eliminating the need for a python loop would be ideal.

I believe the correct answer is using Count("publications", distinct=True), as described here:
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#combining-multiple-aggregations

Do Django models really need a single unique key field

Some of my models are only unique in a combination of keys. I don't want to use an auto-numbering id as the identifier as subsets of the data will be exported to other systems (such as spreadsheets), modified and then used to update the master database.
Here's an example:
class Statement(models.Model):
supplier = models.ForeignKey(Supplier)
total = models.DecimalField("statement total", max_digits=10, decimal_places=2)
statement_date = models.DateField("statement date")
....
class Invoice(models.Model):
supplier = models.ForeignKey(Supplier)
amount = models.DecimalField("invoice total", max_digits=10, decimal_places=2)
invoice_date = models.DateField("date of invoice")
statement = models.ForeignKey(Statement, blank=True, null=True)
....
Invoice records are only unique for a combination of supplier, amount and invoice_date
I'm wondering if I should create a slug for Invoice based on supplier, amount and invoice_date so that it is easy to identify the correct record.
An example of the problem of having multiple related fields to identify the right record is django-csvimport which assumes there is only one related field and will not discriminate on two when building the foreign key links.
Yet the slug seems a clumsy option and needs some kind of management to rebuild the slugs after adding records in bulk.
I'm thinking this must be a common problem and maybe there's a best practice design pattern out there somewhere.
I am using PostgreSQL in case anyone has a database solution. Although I'd prefer to avoid that if possible, I can see that it might be the way to build my slug if that's the way to go, perhaps with trigger functions. That just feels a bit like hidden functionality though, and may cause a headache for setting up on a different server.
UPDATE - after reading initial replies
My application requires that data may be exported, modified remotely, and merged back into the master database after review and approval. Hidden autonumber keys don't easily survive that consistently. The relation invoices[2417] is part of statements[265] is not persistent if the statement table was emptied and reloaded from a CSV.
If I use the numeric autonumber pk then any process that is updating the database would need to refresh the related key numbers or by using the multiple WITH clause.
If I create a slug that is based on my 3 keys but easy to reproduce then I can use it as the key - albeit clumsily. I'm thinking of a slug along the lines:
u'%s %s %s' % (self.supplier,
self.statement_date.strftime("%Y-%m-%d"),
self.total)
This seems quite clumsy and not very DRY as I expect I may have to recreate the slug elsewhere duplicating the algorithm (maybe in an Excel formula, or an Access query)
I thought there must be a better way I'm missing but it looks like yuvi's reply means there should be, and there will be, but not yet :-(

What you're talking about it a multi-column primary key, otherwise known as "composite" or "compound" keys. Support in django for composite keys today is still in the works, you can read about it here:
Currently Django models only support a single column in this set,
denying many designs where the natural primary key of a table is
multiple columns [...] Current state is that the issue is
accepted/assigned and being worked on [...]
The link also mentions a partial implementation which is django-compositekeys. It's only partial and will cause you trouble with navigating between relationships:
support for composite keys is missing in ForeignKey and
RelatedManager. As a consequence, it isn't possible to navigate
relationships from models that have a composite primary key.
So currently it isn't entirely supported, but will be in the future. Regarding your own project, you can make of that what you will, though my own suggestion is to stick with the fully supported default of a hidden auto-incremented field that you don't even need to think about (and use unique_together to enforce the uniqness of the described fields instead of making them your primary keys).
I hope this helps!

No.
Model needs to have one field that is primary_key = True. By default this is the (hidden) autofield which stores object Id. But you can set primary_key to True at any other field.
I've done this in cases, Where i'm creating django project upon tables which were previously created manually or through some other frameworks/systems.
In reality - you can use whatever means you can think of, for joining objects together in queries. As long as query returns bunch of data that can be associated with models you have - it does not really matter which field you are using for joins. Just keep in mind, that the solution you use should be as effective as possible.
Alan

Django .order_by() with .distinct() using postgres

I have a Read model that is related to an Article model. What I would like to do is make a queryset where articles are unique and ordered by date_added. Since I'm using postgres, I'd prefer to use the .distinct() method and specify the article field. Like so:
articles = Read.objects.order_by('article', 'date_added').distinct('article')
However this doesn't give the desired effect and orders the queryset by the order they were created. I am aware of the note about .distinct() and .order_by() in Django's documentation, but I don't see that it applies here since the side effect it mentions is there will be duplicates and I'm not seeing that.
# To actually sort by date added I end up doing this
articles = sorted(articles, key=lambda x: x.date_added, reverse=True)
This executes the entire query before I actually need it and could potentially get very slow if there are lots of records. I've already optimized using select_related().
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
UPDATE
The output would ideally be a queryset of Read instances where their related article is unique in the queryset and only using the Django orm (i.e. sorting in python).

Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
Possibily. It's hard to say without the full picture, but my assumption is that you are using Read to track which articles have and have not been read, and probably tying this to User instance to determine if a particular user has read an article or not. If that's the case, your approach is flawed. Instead, you should do something like:
class Article(models.Model):
...
read_by = models.ManyToManyField(User, related_name='read_articles')
Then, to get a particular user's read articles, you can just do:
user_instance.read_articles.order_by('date_added')
That takes the need to use distinct out of the equation, since there will not be any duplicates now.
UPDATE
To get all articles that are read by at least one user:
Article.objects.filter(read_by__isnull=False)
Or, if you want to set a threshold for popularity, you can use annotations:
from django.db.models import Count
Article.objects.annotate(read_count=Count('read_by')).filter(read_count__gte=10)
Which would give you only articles that have been read by at least 10 users.

Django: Query with F() into an object not behaving as expected

I am trying to navigate into the Price model to compare prices, but met with an unexpected result.
My model:
class ProfitableBooks(models.Model):
price = models.ForeignKey('Price',primary_key=True)
In my view:
foo = ProfitableBooks.objects.filter(price__buy__gte=F('price__sell'))
Producing this error:
'ProfitableBooks' object has no attribute 'sell'

Is this your actual model or a simplification? I think the problem may lie in having a model whose only field is its primary key is a foreign key. If I try to parse that out, it seems to imply that it's essentially a field acting as a proxy for a queryset-- you could never have more profitable books than prices because of the nature of primary keys. It also would seem to mean that your elided books field must have no overlap in prices due to the implied uniqueness constraints.
If I understand correctly, you're trying to compare two values in another model: price.buy vs. price.sell, and you want to know if this unpictured Book model is profitable or not. While I'm not sure exactly how the F() object breaks down here, my intuition is that F() is intended to facilitate a kind of efficient querying and updating where you're comparing or adjusting a model value based on another value in the database. It may not be equipped to deal with a 'shell' model like this which has no fields except a joint primary/foreign key and a comparison of two values both external to the model from which the query is conducted (and also distinct from the Book model which has the identifying info about books, I presume).
The documentation says you can use a join in an F() object as long as you are filtering and not updating, and I assume your price model has a buy and sell field, so it seems to qualify. So I'm not 100% sure where this breaks down behind the scenes. But from a practical perspective, if you want to accomplish exactly the result implied here, you could just do a simple query on your price model, b/c again, there's no distinct data in the ProfitableBooks model (it only returns prices), and you're also implying that each price.buy and price.sell have exactly one corresponding book. So Price.objects.filter(buy__gte=F('sell')) gives the result you've requested in your snipped.
If you want to get results which are book objects, you should do a query like the one you've got here, but start from your Book model instead. You could put that query in a queryset manager called "profitable_books" or something, if you wanted to substantiate it in some way.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js