Remove duplicates in Django ORM -- multiple rows - django

I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.

def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic

If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()

Related

Django inner join with search terms for tables across many-to-many relationships

I have a dynamic hierarchical advanced search interface I created. Basically, it allows you to search terms, from about 6 or 7 tables that are all linked together, that can be and'ed or or'ed together in any combination. The search entries in the form are all compiled into a complex Q expression.
I discovered a problem today. If I provide a search term for a field in a many-to-many related sub-table, the output table can include results from that table that don't match the term.
My problem can by reproduced in the shell with a simple query:
qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
sqs = qs[0].msrun.sample.animal.studies.all()
sqs.count()
#result: 2
Specifically:
In [3]: qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
In [12]: ss = s[0].msrun.sample.animal.studies.all()
In [13]: ss[0].__dict__
Out[13]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bfbf940>,
'id': 3,
'name': 'Small OBOB',
'description': ''}
In [14]: ss[1].__dict__
Out[14]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bea81f0>,
'id': 1,
'name': 'obob_fasted',
'description': ''}
The ids in sqs queryset include 1 & 3 even though I only searched on 3. I don't get back literally all studies, so it is filtering some un-matching study records. I understand why I see that, but I don't know how to execute a query that treats it like a join I could perform in SQL where I can restrict the results to only include records that match the query, instead of getting back only records in the root model and gathering everything left-joined to those root model records.
Is there a way to do such an inner join (as the result of a single complex Q expression in a filter) on the entire set of linked tables so that I only get back records that match the M:M field search term?
UPDATE:
By looking at the SQL:
In [3]: str(s.query)
Out[3]: 'SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC'
...I can see that the query is as specific as I would like it to be, but in the template, how do I specify that I want what I would have seen in the SQL result, had I supplied all of the specific related table fields I wanted to see in the output? E.g.:
SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id", "DataRepo_animal_studies"."study_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC
All the Django ORM gives you back from a query (whose filters use fields from many-to-many [henceforth "M:M"] related tables), is a set of records from the "root" table from which the query started. It uses the join logic that you would use in an SQL query, obtaining records using the foreign keys, so you are guaranteed to get back root table records that DO link to a M:M related table record that matches your search term, but when you send the root table records(/queryset) to be rendered in a template and insert a nested loop to access records from the M:M related table, it always gets everything linked to that root table record - whether they match your search term or not, so if you get back a root table record that links to multiple records in an M:M related table, at least 1 record is guaranteed to match your search term, but other records may not.
In order to get the inner join (i.e. only including combined records that match search terms in the table(s) queried), you simply have to roll-your-own, because Django doesn't support it. I accomplished this in the following way:
When rendering search results, wherever you want to include records from a M:M related table, you have to create a nested loop. At the inner-most loop, I essentially re-enforce that the record matches all of the search terms. I.e. I implement my own filter using conditionals on each combination of records (from the root table and from each of the M:M related table). If any record combination does not match, I skip it.
Regarding the particulars as to how I did that, I won't get too into the weeds, but at each nested loop, I maintain a dict of its key path (e.g. msrun__sample__animal__studies) as the key and the current record as the value. I then use a custom simple_tag to re-run the search terms against the current record from each table (by checking the key path of the search term against those available in the dict).
Alternatively, you could do this in the view by compiling all of the matching records and sending the combined large table to the template - but I opted not to do this because of the hurdles surrounding cached_properties and since I already pass the search term form data to re-display the executed search form, the structure of the search was already available.
Watch out #1 - Even without "refiltering" the combined records in the template, note that the record count can be inaccurate when dealing with a combined/composite/joined table. To ensure that the number of records I report in my header above the table was always correct/accurate (i.e. represented the number of rows in my html table), I kept track of the filtered count and used javascript under the table to update the record count reported at the top.
Watch out #2 - There is one other thing to keep in mind when querying using terms from M:M related tables. For every M:M related table record that matches the search term, you will get back a duplicate record from the root table, and there's no way to tell the difference between them (i.e. which M:M record/field value matched in each case. For example, if you matched 2 records from an M:M related table, you would get back 2 identical root table records, and when you pass that data to the template, you would end up with 4 rows in your results, each with the same data from the root table record. To avoid this, all you have to do is append the .distinct() filter to you results queryset.

How to find duplicate and deactivate duplicates for user attributes

Suppose we have a model in django defined as follows:
class DateClass:
user_id = models.IntegerField(...)
sp_date = models.DateField(...)
is_active = models.BooleanField(...)
...
I follow insert policy here, i.e, for a specific user there will be only one specific active date. That means, there will be only one active row for user=1 at date table for sp_date values 27/10/2021, 28/10/2021 and so one. There shouldn't be two active rows for 27/10/2021 for user=1, but for other users have there rows for 27/10/2021. Whenever a date has to be updated, I deactivate (is_active=False) the previous row and add a new row for specific date.
I want to find duplicate active dates for each users in one single query, and then deactivate (set is_active=False) all the duplicate values except the last row (The row which was last inserted). Two rows will be duplicate if the values of user_id and sp_date are equal and both have is_active=True. I know how to find duplicates for a specific column which is fairly easy. But I can't think of something which can do the above task elegantly. I can only think of following approach:
for user in users:
dates = DateClass(user_id=user.id, is_active=True)
for date in dates:
days = dates.filter(
sp_date=date.sp_date, is_active=True
)
if days.count() > 1:
last_day = days.last()
days.exclude(id=last_day.id).update(is_active=False)
As you can see above one is not that efficient, as I have to loop through all users. Is there any way to do this more efficiently? I am using PostgreSQL for database.
There a great answer for multiple duplicate fields queryset from this answer as i don't want to take the credit and also don't want to reinvent the wheel, so i will suggest that answer
For your case it should be:
from django.db.models import Max, Count
duplicate_date_class = DateClass.objects.values('user_id', 'sp_date') \
.annotate(records=Count('user_id')) \
.filter(records__gt=1)
# Then do operations on duplicates
for date_class in duplicate_date_class:
DateClass.objects.filter(
user_id=date_class['user_id'],
sp_date=date_class['sp_date']
)[1:].update(is_active=False)
If you want to avoid having duplicate set of multiple fields, i suggest taking a look at unique_together for model validation

How do I use django's Q with django taggit?

I have a Result object that is tagged with "one" and "two". When I try to query for objects tagged "one" and "two", I get nothing back:
q = Result.objects.filter(Q(tags__name="one") & Q(tags__name="two"))
print len(q)
# prints zero, was expecting 1
Why does it not work with Q? How can I make it work?
The way django-taggit implements tagging is essentially through a ManytoMany relationship. In such cases there is a separate table in the database that holds these relations. It is usually called a "through" or intermediate model as it connects the two models. In the case of django-taggit this is called TaggedItem. So you have the Result model which is your model and you have two models Tag and TaggedItem provided by django-taggit.
When you make a query such as Result.objects.filter(Q(tags__name="one")) it translates to looking up rows in the Result table that have a corresponding row in the TaggedItem table that has a corresponding row in the Tag table that has the name="one".
Trying to match for two tag names would translate to looking up up rows in the Result table that have a corresponding row in the TaggedItem table that has a corresponding row in the Tag table that has both name="one" AND name="two". You obviously never have that as you only have one value in a row, it's either "one" or "two".
These details are hidden away from you in the django-taggit implementation, but this is what happens whenever you have a ManytoMany relationship between objects.
To resolve this you can:
Option 1
Query tag after tag evaluating the results each time, as it is suggested in the answers from others. This might be okay for two tags, but will not be good when you need to look for objects that have 10 tags set on them. Here would be one way to do this that would result in two queries and get you the result:
# get the IDs of the Result objects tagged with "one"
query_1 = Result.objects.filter(tags__name="one").values('id')
# use this in a second query to filter the ID and look for the second tag.
results = Result.objects.filter(pk__in=query_1, tags__name="two")
You could achieve this with a single query so you only have one trip from the app to the database, which would look like this:
# create django subquery - this is not evaluated, but used to construct the final query
subquery = Result.objects.filter(pk=OuterRef('pk'), tags__name="one").values('id')
# perform a combined query using a subquery against the database
results = Result.objects.filter(Exists(subquery), tags__name="two")
This would only make one trip to the database. (Note: filtering on sub-queries requires django 3.0).
But you are still limited to two tags. If you need to check for 10 tags or more, the above is not really workable...
Option 2
Query the relationship table instead directly and aggregate the results in a way that give you the object IDs.
# django-taggit uses Content Types so we need to pick up the content type from cache
result_content_type = ContentType.objects.get_for_model(Result)
tag_names = ["one", "two"]
tagged_results = (
TaggedItem.objects.filter(tag__name__in=tag_names, content_type=result_content_type)
.values('object_id')
.annotate(occurence=Count('object_id'))
.filter(occurence=len(tag_names))
.values_list('object_id', flat=True)
)
TaggedItem is the hidden table in the django-taggit implementation that contains the relationships. The above will query that table and aggregate all the rows that refer either to the "one" or "two" tags, group the results by the ID of the objects and then pick those where the object ID had the number of tags you are looking for.
This is a single query and at the end gets you the IDs of all the objects that have been tagged with both tags. It is also the exact same query regardless if you need 2 tags or 200.
Please review this and let me know if anything needs clarification.
first of all, this three are same:
Result.objects.filter(tags__name="one", tags__name="two")
Result.objects.filter(Q(tags__name="one") & Q(tags__name="two"))
Result.objects.filter(tags__name_in=["one"]).filter(tags__name_in=["two"])
i think the name field is CharField and no record could be equal to "one" and "two" at same time.
in python code the query looks like this(always false, and why you are geting no result):
from random import choice
name = choice(["abtin", "shino"])
if name == "abtin" and name == "shino":
we use Q object for implement OR or complex queries
Into the example that works you do an end on two python objects (query sets). That gets applied to any record not necessarily to the same record that has one AND two as tag.
ps: Why do you use the in filter ?
q = Result.objects.filter(tags_name_in=["one"]).filter(tags_name_in=["two"])
add .distinct() to remove duplicates if expecting more than one unique object

Django Unique Bulk Inserts

I need to be able to quickly bulk insert large amounts of records quickly, while still ensuring uniqueness in the database. The new records to be inserted have already been parsed, and are unique. I'm hoping there is a way to enforce uniqueness at the database level, and not in the code itself.
I'm using MySQL as the database backend. If django supports this functionality in any other database, I am flexible in changing the backend, as this is a requirement.
Bulk inserts in Django don't use the save method, so how can I insert several hundred to several thousand records at a time, while still respecting unique fields and unique together fields?
My model structures, simplified, look something like this:
class Example(models.Model):
Meta:
unique_together = (('name', 'number'),)
name = models.CharField(max_length = 50)
number = models.CharField(max_length = 10)
...
fk = models.ForeignKey(OtherModel)
Edit:
The records that aren't already in the database should be inserted, and the records that already existed should be ignored.
As miki725's mentioned you don't have a problem with your current code.
I'm assuming you are using the bulk_create method. It is true that the save() method is not called when using bulk_create, but the uniqueness of fields is not enforced inside the save() method. When you use unique_together a unique constraint is added to the underlying table in mysql when creating the table:
Django:
unique_together = (('name', 'number'),)
MySQL:
UNIQUE KEY `name` (`name`,`number`)
So if you insert a value into the table using any method (save, bulk_insert or even raw sql) you will get this exception from mysql:
Duplicate entry 'value1-value2' for key 'name'
UPDATE:
What bulk_insert does is that it creates one big query that inserts all the data at once with one query. So if one of the entries is duplicate, it throws an exception and none of the data is inserted.
1- One option is to use batch_size parameter of bulk_insert and make it insert the data in a number of batches so that if one of them fails you only miss rest of the data of that batch. (depends how important it is to insert all the data and how frequent the duplicate entries are)
2- Another option is to write a for loop over the bulk data and insert the bulk data one by one. This way the exception is thrown for that one row only and the rest of the data is inserted. This is gonna query the db every time and is of course a lot slower.
3- Third option is to lift the unique constraint, insert the data using bulk_create and then write a simple query that deletes the duplicate rows.
Django itself does not enforce the unique_together meta attribute. This is enforced by the database using the UNIQUE clause. You can insert as much data as you want and you are guaranteed that the specified fields will be unique. If not, then an exception will be raised (not sure which one). More about unique_together in the docs.

Migrate Django model to unique_together constraint

I have a model with three fields
class MyModel(models.Model):
a = models.ForeignKey(A)
b = models.ForeignKey(B)
c = models.ForeignKey(C)
I want to enforce a unique constraint between these fields, and found django's unique_together, which seems to be the solution. However, I already have an existing database, and there are many duplicates. I know that since unique_together works at the database level, I need to unique-ify the rows, and then try a migration.
Is there a good way to go about removing duplicates (where a duplicate has the same (A,B,C)) so that I can run migration to get the unique_together contstraint?
If you are happy to choose one of the duplicates arbitrarily, I think the following might do the trick. Perhaps not the most efficient but simple enough and I guess you only need to run this once. Please verify this all works yourself on some test data in case I've done something silly, since you are about to delete a bunch of data.
First we find groups of objects which form duplicates. For each group, (arbitrarily) pick a "master" that we are going to keep. Our chosen method is to pick the one with lowest pk
from django.db.models import Min, Count
master_pks = MyModel.objects.values('A', 'B', 'C'
).annotate(Min('pk'), count=Count('pk')
).filter(count__gt=1
).values_list('pk__min', flat=True)
we then loop over each master, and delete all its duplicates
masters = MyModel.objects.in_bulk( list(master_pks) )
for master in masters.values():
MyModel.objects.filter(a=master.a, b=master.b, c=master.c
).exclude(pk=master.pk).del_ACCIDENT_PREVENTION_ete()
I want to add a slightly improved answer that will delete everything in a single query, instead of looping and deleting for each duplicate group. This will be much faster if you have a lot of records.
non_dupe_pks = list(
Model.objects.values('A', 'B', 'C')
.annotate(Min('pk'), count=Count('pk'))
.order_by()
.values_list('pk__min', flat=True)
)
dupes = Model.objects.exclude(pk__in=non_dupe_pks)
dupes.delete()
It's important to add order_by() in the first query otherwise the default ordering in the model might mess up with the aggregation.
You can comment out the last line and use dupes.count() to check if the query is working as expected.