How to Remove Duplicates Values after Merging Different Model Querysets in Django - django

how to remove duplicate after merge two different models like this
import intertools
events_list = list(itertools.chain(events_list, speakers_list))
I am getting duplicate values in Django REst serializer

You can use union(); here is doc about union()
qs1.union(qs2)
# no duplicates
By default, union()only gives you distinct values. If you want to allow duplicates, you use
qs1.union(qs2, all=True)
# allow duplicates

A set enforces uniqueness
events_list = set(itertools.chain(events_list, speakers_list))

a = queryset1
b = queryset2
qs = a | b #removes duplicates

distinct() is doing this job :
queryset_final = queryset_with_duplicate.distinct()
Cf doc
By default, a QuerySet will not eliminate duplicate rows. In practice,
this is rarely a problem, because simple queries such as
Blog.objects.all() don’t introduce the possibility of duplicate result
rows. However, if your query spans multiple tables, it’s possible to
get duplicate results when a QuerySet is evaluated. That’s when you’d
use distinct().

Related

Django filter, filter return more than this source

print("Step 1",invs.count()) # -> 1000 # invs type: query
invs2 = invs.filter(field___fields2__fields3=i) # i type:int
print("Step 2",invs2.count()) # -> 40000
Is it normal for the filter function to return more than its origin ?
Thank you.
Yes, theres an entire section in the docs that explain it
Lookups that span relationships
Inside the big green note block further down below the "Spanning multi-valued relationships" heading it states
However, unlike the behavior when using filter(), this will not limit blogs based on entries that satisfy both conditions. In order to do that, i.e. to select all blogs that do not contain entries published with “Lennon” that were published in 2008, you need to make two queries:
The relevant information can be found in the description of distinct().
A cite:
Returns a new QuerySet that uses SELECT DISTINCT in its SQL query. This eliminates duplicate rows from the query results.
By default, a QuerySet will not eliminate duplicate rows. In practice, this is rarely a problem, because simple queries such as Blog.objects.all() don’t introduce the possibility of duplicate result rows. However, if your query spans multiple tables, it’s possible to get duplicate results when a QuerySet is evaluated. That’s when you’d use distinct().

Remove duplicates in Django ORM -- multiple rows

I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()

Migrate Django model to unique_together constraint

I have a model with three fields
class MyModel(models.Model):
a = models.ForeignKey(A)
b = models.ForeignKey(B)
c = models.ForeignKey(C)
I want to enforce a unique constraint between these fields, and found django's unique_together, which seems to be the solution. However, I already have an existing database, and there are many duplicates. I know that since unique_together works at the database level, I need to unique-ify the rows, and then try a migration.
Is there a good way to go about removing duplicates (where a duplicate has the same (A,B,C)) so that I can run migration to get the unique_together contstraint?
If you are happy to choose one of the duplicates arbitrarily, I think the following might do the trick. Perhaps not the most efficient but simple enough and I guess you only need to run this once. Please verify this all works yourself on some test data in case I've done something silly, since you are about to delete a bunch of data.
First we find groups of objects which form duplicates. For each group, (arbitrarily) pick a "master" that we are going to keep. Our chosen method is to pick the one with lowest pk
from django.db.models import Min, Count
master_pks = MyModel.objects.values('A', 'B', 'C'
).annotate(Min('pk'), count=Count('pk')
).filter(count__gt=1
).values_list('pk__min', flat=True)
we then loop over each master, and delete all its duplicates
masters = MyModel.objects.in_bulk( list(master_pks) )
for master in masters.values():
MyModel.objects.filter(a=master.a, b=master.b, c=master.c
).exclude(pk=master.pk).del_ACCIDENT_PREVENTION_ete()
I want to add a slightly improved answer that will delete everything in a single query, instead of looping and deleting for each duplicate group. This will be much faster if you have a lot of records.
non_dupe_pks = list(
Model.objects.values('A', 'B', 'C')
.annotate(Min('pk'), count=Count('pk'))
.order_by()
.values_list('pk__min', flat=True)
)
dupes = Model.objects.exclude(pk__in=non_dupe_pks)
dupes.delete()
It's important to add order_by() in the first query otherwise the default ordering in the model might mess up with the aggregation.
You can comment out the last line and use dupes.count() to check if the query is working as expected.

How to change the default ordering of a query?

I don't want any ordering to be applied to a query. So, I have a QuerySet follow as:
question_obj = Question.objects.filter(pk__in=[100,50,27,35,10,42,68]).order_by()
However, when I retrieve the results, they are always ordered by questionID. I iterate the question_obj and this is the result:
for obj in question_obj:
obj.questionID
The result is displayed such as:
10L
27L
35L
42L
50L
68L
100L
If you want to display the objects in the same order as the list of primary keys, then you could use in_bulk to create a dictionary keyed by pk. You can then use a list comprehension to generate the list of questions.
>>> pks = [100,50,27,35,10,42,68]
>>> questions_dict = Question.objects.in_bulk(pks)
>>> questions = [questions_dict[pk] for pk in pks]
>>> for question in questions:
print question.pk
100
50
27
35
...
If you want an unordered collection, use Python's Set object, documented here: http://docs.python.org/tutorial/datastructures.html#sets
If you want the ordering to be the same as the list you're passing as the value for pk__in, you could try:
ids = [100,50,27,35,10,42,68]
questions = list(Question.objects.filter(pk__in=ids))
question_obj = sorted(questions, key=lambda x: ids.index(x.id))
EDIT: And because it's extremely unclear as to what you mean by 'unordered' in reference to a data structure that is by definition ordered: Random ordering can be accomplished through the following:
.order_by('?')
Luke, use the Source, er, the Docs! Yeah, that's it!
Django QuerySet API - order_by()
You could do some raw SQL (with FIELD()) a lá:
Ordering by the order of values in a SQL IN() clause
which should allow you to retrieve them in the order suggested in the list.
To run custom SQL with the ORM:
https://docs.djangoproject.com/en/dev/topics/db/sql/#executing-custom-sql-directly

Using .extra() on fields created by .annotate() in Django

I want to retrieve a sum of two fields (which are aggregations themselves) for each object in a table.
The following may describe a bit better what I'm after but results in an Unknown column in field list-Error:
items = MyModel.objects.annotate(
field1=Sum("relatedModel__someField"),
field2=Sum("relatedModel__someField")).extra(
select={"sum_field1_field2": "field1 + field2"})
I also tried using F() for the field lookups but that gives me an invalid sql statement.
Any ideas on how to solve this are much appreciated.
it this what you want?
items = MyModel.objects.extra(
select = {'sum_field1_field2': 'SUM(relatedModel__someField) + SUM(relatedModel__someField)'},
)
To make it work for many to many or for many to one (reverse) relations, you may use the following:
items = MyModel.objects.extra(
select = {'sum_field1_field2': 'SUM("relatedModel"."someField") + SUM("relatedModel"."someField")'},
)
But this will break also if you need another annotate, like for a count, because extra will add the statement to the GROUP BY clause, whereas aggregate functions are not allowed in there.