How to find duplicate and deactivate duplicates for user attributes - django

Suppose we have a model in django defined as follows:
class DateClass:
user_id = models.IntegerField(...)
sp_date = models.DateField(...)
is_active = models.BooleanField(...)
...
I follow insert policy here, i.e, for a specific user there will be only one specific active date. That means, there will be only one active row for user=1 at date table for sp_date values 27/10/2021, 28/10/2021 and so one. There shouldn't be two active rows for 27/10/2021 for user=1, but for other users have there rows for 27/10/2021. Whenever a date has to be updated, I deactivate (is_active=False) the previous row and add a new row for specific date.
I want to find duplicate active dates for each users in one single query, and then deactivate (set is_active=False) all the duplicate values except the last row (The row which was last inserted). Two rows will be duplicate if the values of user_id and sp_date are equal and both have is_active=True. I know how to find duplicates for a specific column which is fairly easy. But I can't think of something which can do the above task elegantly. I can only think of following approach:
for user in users:
dates = DateClass(user_id=user.id, is_active=True)
for date in dates:
days = dates.filter(
sp_date=date.sp_date, is_active=True
)
if days.count() > 1:
last_day = days.last()
days.exclude(id=last_day.id).update(is_active=False)
As you can see above one is not that efficient, as I have to loop through all users. Is there any way to do this more efficiently? I am using PostgreSQL for database.

There a great answer for multiple duplicate fields queryset from this answer as i don't want to take the credit and also don't want to reinvent the wheel, so i will suggest that answer
For your case it should be:
from django.db.models import Max, Count
duplicate_date_class = DateClass.objects.values('user_id', 'sp_date') \
.annotate(records=Count('user_id')) \
.filter(records__gt=1)
# Then do operations on duplicates
for date_class in duplicate_date_class:
DateClass.objects.filter(
user_id=date_class['user_id'],
sp_date=date_class['sp_date']
)[1:].update(is_active=False)
If you want to avoid having duplicate set of multiple fields, i suggest taking a look at unique_together for model validation

Related

Join two records from same model in django queryset

Been searching the web for a couple hours now looking for a solution but nothing quite fits what I am looking for.
I have one model (simplified):
class SimpleModel(Model):
name = CharField('Name', unique=True)
date = DateField()
amount = FloatField()
I have two dates; date_one and date_two.
I would like a single queryset with a row for each name in the Model, with each row showing:
{'name': name, 'date_one': date_one, 'date_two': date_two, 'amount_one': amount_one, 'amount_two': amount_two, 'change': amount_two - amount_one}
Reason being I would like to be able to find the rank of amount_one, amount_two, and change, using sort or filters on that single queryset.
I know I could create a list of dictionaries from two separate querysets then sort on that and get the ranks from the index values ...
but perhaps nievely I feel like there should be a DB solution using one queryset that would be faster.
union seemed promising but you cannot perform some simple operations like filter after that
I think I could perhaps split name into its own Model and generate queryset with related fields, but I'd prefer not to change the schema at this stage. Also, I only have access to sqlite.
appreciate any help!
Your current model forces you to have ONE name associated with ONE date and ONE amount. Because name is unique=True, you literally cannot have two dates associated with the same name
So if you want to be able to have several dates/amounts associated with a name, there are several ways to proceed
Idea 1: If there will only be 2 dates and 2 amounts, simply add a second date field and a second amount field
Idea 2: If there can be an infinite number of days and amounts, you'll have to change your model to reflect it, by having :
A model for your names
A model for your days and amounts, with a foreign key to your names
Idea 3: You could keep the same model and simply remove the unique constraint, but that's a recipe for mistakes
Based on your choice, you'll then have several ways of querying what you need. It depends on your final model structure. The best way to go would be to create custom model methods that query the 2 dates/amount, format an array and return it

django setting filter field with a variable

I show a model of sales that can be aggregated by different fields through a form. Products, clients, categories, etc.
view_by_choice = filter_opts.cleaned_data["view_by_choice"]
sales = sales.values(view_by_choice).annotate(........).order_by(......)
In the same form I have a string input where the user can filter the results. By "product code" for example.
input_code = filter_opts.cleaned_data["filter_code"]
sales = sales.filter(prod_code__icontains=input_code)
What I want to do is filter the queryset "sales" by the input_code, defining the field dynamically from the view_by_choice variable.
Something like:
sales = sales.filter(VARIABLE__icontains=input_code)
Is it possible to do this? Thanks in advance.
You can make use of dictionary unpacking [PEP-448] here:
sales = sales.filter(
**{'{}__icontains'.format(view_by_choice): input_code}
)
Given that view_by_choice for example contains 'foo', we thus first make a dictionary { 'foo__icontains': input_code }, and then we unpack that as named parameter with the two consecutive asterisks (**).
That being said, I strongly advice you to do some validation on the view_by_choice: ensure that the number of valid options is limited. Otherwise a user might inject malicious field names, lookups, etc. to exploit data from your database that should remain hidden.
For example if you model has a ForeignKey named owner to the User model, he/she could use owner__email, and thus start trying to find out what emails are in the database by generating a large number of queries and each time looking what values that query returned.

Django query aggregate upvotes in backward relation

I have two models:
Base_Activity:
some fields
User_Activity:
user = models.ForeignKey(settings.AUTH_USER_MODEL)
activity = models.ForeignKey(Base_Activity)
rating = models.IntegerField(default=0) #Will be -1, 0, or 1
Now I want to query Base_Activity, and sort the items that have the most corresponding user activities with rating=1 on top. I want to do something like the query below, but the =1 part is obviously not working.
activities = Base_Activity.objects.all().annotate(
up_votes = Count('user_activity__rating'=1),
).order_by(
'up_votes'
)
How can I solve this?
You cannot use Count like that, as the error message says:
SyntaxError: keyword can't be an expression
The argument of Count must be a simple string, like user_activity__rating.
I think a good alternative can be to use Avg and Count together:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).order_by(
'-a', '-c'
)
The items with the most rating=1 activities should have the highest average, and among the users with the same average the ones with the most activities will be listed higher.
If you want to exclude items that have downvotes, make sure to add the appropriate filter or exclude operations after annotate, for example:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).filter(user_activity__rating__gt=0).order_by(
'-a', '-c'
)
UPDATE
To get all the items, ordered by their upvotes, disregarding downvotes, I think the only way is to use raw queries, like this:
from django.db import connection
sql = '''
SELECT o.id, SUM(v.rating > 0) s
FROM user_activity o
JOIN rating v ON o.id = v.user_activity_id
GROUP BY o.id ORDER BY s DESC
'''
cursor = connection.cursor()
result = cursor.execute(sql_select)
rows = result.fetchall()
Note: instead of hard-coding the table names of your models, get the table names from the models, for example if your model is called Rating, then you can get its table name with Rating._meta.db_table.
I tested this query on an sqlite3 database, I'm not sure the SUM expression there works in all DBMS. Btw I had a perfect Django site to test, where I also use upvotes and downvotes. I use a very similar model for counting upvotes and downvotes, but I order them by the sum value, stackoverflow style. The site is open-source, if you're interested.

Remove duplicates in Django ORM -- multiple rows

I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()

Migrate Django model to unique_together constraint

I have a model with three fields
class MyModel(models.Model):
a = models.ForeignKey(A)
b = models.ForeignKey(B)
c = models.ForeignKey(C)
I want to enforce a unique constraint between these fields, and found django's unique_together, which seems to be the solution. However, I already have an existing database, and there are many duplicates. I know that since unique_together works at the database level, I need to unique-ify the rows, and then try a migration.
Is there a good way to go about removing duplicates (where a duplicate has the same (A,B,C)) so that I can run migration to get the unique_together contstraint?
If you are happy to choose one of the duplicates arbitrarily, I think the following might do the trick. Perhaps not the most efficient but simple enough and I guess you only need to run this once. Please verify this all works yourself on some test data in case I've done something silly, since you are about to delete a bunch of data.
First we find groups of objects which form duplicates. For each group, (arbitrarily) pick a "master" that we are going to keep. Our chosen method is to pick the one with lowest pk
from django.db.models import Min, Count
master_pks = MyModel.objects.values('A', 'B', 'C'
).annotate(Min('pk'), count=Count('pk')
).filter(count__gt=1
).values_list('pk__min', flat=True)
we then loop over each master, and delete all its duplicates
masters = MyModel.objects.in_bulk( list(master_pks) )
for master in masters.values():
MyModel.objects.filter(a=master.a, b=master.b, c=master.c
).exclude(pk=master.pk).del_ACCIDENT_PREVENTION_ete()
I want to add a slightly improved answer that will delete everything in a single query, instead of looping and deleting for each duplicate group. This will be much faster if you have a lot of records.
non_dupe_pks = list(
Model.objects.values('A', 'B', 'C')
.annotate(Min('pk'), count=Count('pk'))
.order_by()
.values_list('pk__min', flat=True)
)
dupes = Model.objects.exclude(pk__in=non_dupe_pks)
dupes.delete()
It's important to add order_by() in the first query otherwise the default ordering in the model might mess up with the aggregation.
You can comment out the last line and use dupes.count() to check if the query is working as expected.