Query excluding duplicates in Django - django

I'm using distinct() QuerySet to get some data in Django.
My initial query was Point.objects.order_by('chron', 'pubdate').
The field chron in some cases is a duplicate so I changed the query
to Point.objects.order_by('chron', 'pubdate').distinct('chron') in order to exclude duplicates.
Now the problem is that all empty fields are considered duplicates.
To be accurate, the chron field contain integers (which behave similar to ids), in some cases it can be a duplicate, in some cases it can be NULL.
| chron |
|-------|
| 1 | I want this
| 2 | I want this
| 3 | I want this
| 3 |
| NULL |
| 4 | I want this
| NULL |
I want to exclude all the chron duplicates but not if they are duplicate of NULL.
Thank you.

Use two separate queries.
.distinct("chron").exclude(chron__isnull=True)
.filter() for only chron values where chron__isnull=True.
Although this seems pretty inefficient I believe (I will happily be corrected) that even any sensible vanilla SQL statement (eg. below) would require multiple table scans to join a result set of nulls and unique values.
SELECT *
FROM (
SELECT chron
FROM Point
WHERE chron IS NOT NULL # .exclude()
GROUP BY chron # .distinct()
UNION ALL
SELECT chron
FROM Point
WHERE chron IS NULL # .include()
)

Related

Efficient way of joining two query sets without foreign key

I know django doesn't allow joining without a foreign key relation and I can't specify a foreign key because there are entries in one table that are not in the other (populated using pyspark). I need an efficient way to query the following:
Let's say I have the following tables:
Company | Product | Total # Users | Total # Unique Users
and
Company | Product | # Licenses | # Estimated Users
I would like to join such that I can display a table like this on the frontend
Company View
Product|Total # Users|Total # Unique Users|#Licenses|# Estimated Users|
P1 | Num | Num | Num | Num |
P2 | Num | Num | Num | Num |
Currently loop through each product and perform a query (way too slow and inefficient) to populate a dictionary of lists
Way too inefficient
I'm not quite getting why you can't do a Foreign key in this situation, but if you can implement your query in a sql statement I would look at Q objects. See "Complex Lookups with Q Objects" in the documentation.
https://docs.djangoproject.com/en/2.2/topics/db/queries/#complex-lookups-with-q-objects

Django: stop foreign key column on ManyToMany table from auto-ordering

I have a ManyToMany relationship between a Group model and a Source model:
class Group(models.Model):
source = models.ManyToManyField('Source', null=True)
class Source(models.Model):
content = models.CharField(max_length=8)
This creates an intermediate table with the columns : id (PK), group_id(FK) and source_id (FK)
Source could look like this:
+----+----------+
| id | content |
+----+----------+
| 1 | A |
| 2 | B |
| 3 | C |
+----+----------+
Each group can have different source member in different orders. For example, group 1 could have sources with 'content' C, A and B with keys of 3,1,2 respectively, and in that specific order.
Group 2 could have sources with 'content' B, C, A with keys of 2,3,1 respectively, and also in that specific order
the table should look like
+----+----------+---------------+
| id | group_id | source_id |
+----+----------+---------------+
| 1 | 1 | 3 |
| 2 | 1 | 1 |
| 3 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 2 | 3 |
| 6 | 2 | 1 |
+----+----------+---------------+
The trouble is when I associate these sources in the order I want in a code for loop
sequences = [['C', 'A', 'B'], ['B', 'C', 'A']]
for seq in sequences:
group = models.Group()
group.save()
for letter in seq:
source = models.Source.objects.get(content=letter)
source.group_set.add(group)
It ends up in the table as i.e. re-ordered sequentially in order which is definitely what I do not want as in this case the order of the Sources is essential.
+----+----------+---------------+
| id | group_id | source_id |
+----+----------+---------------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 3 |
| 4 | 2 | 1 |
| 5 | 2 | 2 |
| 6 | 2 | 3 |
+----+----------+---------------+
How can I avoid this column re-ordering in Django?
It's important to understand that in SQL there isn't an inherent ordering to the table; the way the information is stored is opaque to you. Rather, the results of each query are ordered according to some specification that you provide at query time.
It sounds like you want the primary key of the M2M table to do double-duty as the field that defines the ordering. In most use cases that is a bad idea. What if you decide later to switch the order of A and B in group 1? What if you need to insert a new Source in between them? You can't do it, because primary keys are not that flexible.
The usual way to do this is to provide a specific column just for ordering. Unlike the primary key field you can change this at will, allowing you to adjust the order, insert new items, etc. In Django you would do this by explicitly declaring the M2M table (using the through field) and adding an ordering column to it. Something like:
class Group(models.Model):
source = models.ManyToManyField('Source', through='GroupSource')
class Source(models.Model):
content = models.CharField(max_length=8)
class GroupSource(models.Model):
# Also look into using unique_together for this model
group = models.ForeignKey(Group)
source = models.ForeignKey(Source)
position = models.IntegerField()
And your code would change to:
sequences = [['C', 'A', 'B'], ['B', 'C', 'A']]
for seq in sequences:
group = models.Group()
group.save()
for position, letter in enumerate(seq):
source = models.Source.objects.get(content=letter)
GroupSource.objects.create(group=group, source=source, position=position)
Thanks for taking the time and effort, and I probably would have gone down the route of doing much the same by adding another field to represent the ordering. But if you can safely get the same thing for free, why bother? These were individual inserts whose order of insertion is important. What puzzled me most later was some tests I have just concluded.
I managed to get the foreign keys still ordered the way I put them in by using sql-connector on a test db with the same schema relationships between the tables. There the keys in the intermediary table holding keys to each of the ManyToMany partners do not re-organise from lowest to highest. However, the exact same code unfortunately still did on the problematic database. Hence it was not a Django thing as such.
The only real difference between the functioning and non-functioning tables was the UNIQUE attribute pointing to the ManyToMany parters i.e foreign keys to Group and Source. After removing them, the problem went away.
However, to be honest, I am not sure why. Or why Django put those UNIQUE attributes there in the first place. Not sure either whether removing them will badly affect the application going forward.

Django: duplicates when filtering on many to many field

I've got the following models in my Django app:
class Book(models.Model):
name = models.CharField(max_length=100)
keywords = models.ManyToManyField('Keyword')
class Keyword(models.Model)
name = models.CharField(max_length=100)
I've got the following keywords saved:
science-fiction
fiction
history
science
astronomy
On my site a user can filter books by keyword, by visiting /keyword-slug/. The keyword_slug variable is passed to a function in my views, which filters Books by keyword as follows:
def get_books_by_keyword(keyword_slug):
books = Book.objects.all()
keywords = keyword_slug.split('-')
for k in keywords:
books = books.filter(keywords__name__icontains=k)
This works for the most part, however whenever I filter with a keyword that contains a string that appears more than once in the keywords table (e.g. science-fiction and fiction), then I get the same book appear more than once in the resulting QuerySet.
I know I can add distinct to only return unique books, but I'm wondering why I'm getting duplicates to begin with, and really want to understand why this works the way it does. Since I'm only calling filter() on successfully filtered QuerySets, how does the duplicate book get added to the results?
The 2 models in your example are represented with 3 tables: book, keyword and book_keyword relation table to manage M2M field.
When you use keywords__name in filter call Django is using SQL JOIN to merge all 3 tables. This allows you to filter objects in 1st table by values from another table.
The SQL will be like this:
SELECT `book`.`id`,
`book`.`name`
FROM `book`
INNER JOIN `book_keyword` ON (`book`.`id` = `book_keyword`.`book_id`)
INNER JOIN `keyword` ON (`book_keyword`.`keyword_id` = `keyword`.`id`)
WHERE (`keyword`.`name` LIKE %fiction%)
After JOIN your data looks like
| Book Table | Relation table | Keyword table |
|---------------------|------------------------------------|------------------------------|
| Book ID | Book name | relation_book_id | relation_key_id | Keyword ID | Keyword name |
|---------|-----------|------------------|-----------------|------------|-----------------|
| 1 | Book 1 | 1 | 1 | 1 | Science-fiction |
| 1 | Book 1 | 1 | 2 | 2 | Fiction |
| 2 | Book 2 | 2 | 2 | 2 | Fiction |
Then when data is loaded from DB into Python you only receive data from book table. As you can see the Book 1 is duplicated there
This is how Many-to-many relation and JOIN works
Direct quote from the Docs: https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships
Successive filter() calls further restrict the
set of objects, but for multi-valued relations, they apply to any
object linked to the primary model, not necessarily those objects that
were selected by an earlier filter() call.
In your case, because keywords is a multi-valued relation, your chain of .filter() calls filters based only on the original model and not on the previous queryset.

Ordering entries via comment count with django

I need to get entries from database with counts of comments. Can i do it with django's comment framework? I am also using a voting application which is not using GenericForeignKeys i get entries with scores like this:
class EntryManager(models.ModelManager):
def get_queryset(self):
return super(EntryManager,self).get_queryset(self).all().annotate(\
score=Sum("linkvote__value"))
But when there is foreignkeys i am being stuck. Do you have any ideas about that?
extra explaination: i need to fetch entries like this:
id | body | vote_score | comment_score |
1 | foo | 13 | 4 |
2 | bar | 4 | 1 |
after doing that, i can order them via comment_score. :)
Thans for all replies.
Apparently, annotating with reverse generic relations (or extra filters, in general) is still an open ticket (see also the corresponding documentation). Until this is resolved, I would suggest using raw SQL in an extra query, like this:
return super(EntryManager,self).get_queryset(self).all().annotate(\
vote_score=Sum("linkvote__value")).extra(select={
'comment_score': """SELECT COUNT(*) FROM comments_comment
WHERE comments_comment.object_pk = yourapp_entry.id
AND comments_comment.content_type = %s"""
}, select_params=(entry_type,))
Of course, you have to fill in the correct table names. Furthermore, entry_type is a "constant" that can be set outside your lookup function (see ContentTypeManager):
from django.contrib.contenttypes.models import ContentType
entry_type = ContentType.objects.get_for_model(Entry)
This is assuming you have a single model Entry that you want to calculate your scores on. Otherwise, things would get slightly more complicated: you would need a sub-query to fetch the content type id for the type of each annotated object.

Django - ORM question

Just wondering if it is possible to get a result that I can get using this SQL query with only Django ORM:
SELECT * FROM (SELECT DATE_FORMAT(created, "%Y") as dte, sum(1) FROM some_table GROUP BY dte) as analytics;
The result is:
+------+--------+
| dte | sum(1) |
+------+--------+
| 2006 | 20 |
| 2007 | 2230 |
| 2008 | 4929 |
| 2009 | 1177 |
+------+--------+
The simplified model looks like this:
# some/models.py
class Table(models.Model):
created = models.DateTimeField(default=datetime.datetime.now)
I've tried various ways using mix of .extra(select={}) and .values() and also using the .query.group_by trick described here but would appreciate a fresh eyes on the problem.
Django 1.1 (trunk at the time of posting this) has aggregates, these allow you to perform counts, mins, sums, averages, etc. in your queries.
What you're looking to do would probably be accomplished using multiple querysets. Remember, each row in a table (even a generated results table) is supposed to be a new object. You don't really explain what you're summing so I'll consider it dollars:
book_years = Books.object.all().order_by('year').distinct()
# I use a list comprehension to filter out just the years
for year in [book_year.created.year for book_year in book_years]:
sum_for_year = Book.objects.filter(created__year=year).aggregate(Sum(sales))
When you need a query that Django doesn't let you express through the ORM, you can always use raw SQL.
For your immediate purpose, I'm thinking that grouping by an expression (and doing an aggregate calculation on the group) is beyond Django's current capabilities.