How to delete 200,000 records with DJango? [duplicate] - django

This question already has answers here:
How to make Django QuerySet bulk delete() more efficient
(3 answers)
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Situation:
I have Model have a relation 1-1, sample:
class User(models.Model):
user_namme = models.CharField(max_length=40)
type = models.CharField(max_length=255)
created_at = models.DatetimeField()
...
class Book(models.Model):
user = models.OneToOneField(User, on_delete=models.CASCADE)
And I have a around 200,000 records.
Languague: Python
Framework: Django
Database: Postgres
Question:
How can I delete 200,000 records above with minimal cost?
Solution I have tried:
user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids.
for i, _ in enumerate(user_ids[:: 1000]):
with transaction.atomic():
batch_start = i * self.batch_size
batch_end = batch_start + self.batch_size
_, deleted = Users.objects.filter(id__in=user_ids[batch_start,batch_end]
With this solution, my server use arround:
600MB CPU
300MB RAM
Take more 15 minutes to finish workload.
I wonder do anyone have a better solution?

By first principles, nothing beats raw(Django query) SQL in terms of speed because it operates closest to the database!
cursor.execute(“DELETE FROM DB WHERE Column = %s”)
Or else you can do it by:
Variable = Model.objects.filter(variable=variable)
if Variable.exists():
Variable.delete()

Thanks, everyone. I have tried the solution with RawQuery
user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids.
for i in range(0, 3):
user_ids_str = ""
for user_id in user_ids.iterator(chunk_size=5000):
user_ids_str += f"{user_id},"
query = f"""
DELETE FROM "user" WHERE "user"."id" IN ({user_ids_str});
DELETE FROM "book" WHERE "user"."id" IN ({user_ids_str});
"""
with transaction.atomic():
with connection.cursor() as c:
c.execute("SET statement_timeout = '10min';")
c.execute(query)
This one can remove 600000 records and take around 10 minutes.
And the server used around:
CPU: 50MB
RAM: 200MB

If you are using straight sql why not do a join on the user table with the date criteria to delete the books and then delete all the users using the created_date criteria? Let the database do all the work!
Even without writing the join
DELETE FROM "book" WHERE "user"."id" IN (select id from user where created_date >= '2022-11-15 08:00' and...)
DELETE FROM "user" WHERE created_date >= '2022-11-15 08:00' and...
would be better than what you have.

Related

How could you make this really reaaally complicated raw SQL query with django's ORM?

Good day, everyone. Hope you're doing well. I'm a Django newbie, trying to learn the basics of RESTful development while helping in a small app project. Currently, there's a really difficult query that I must do to create a calculated field that updates my student's status accordingly to the time interval the classes are in. First, let me explain the models:
class StudentReport(models.Model):
student = models.ForeignKey(Student, on_delete=models.CASCADE,)
headroom_teacher = models.ForeignKey(Teacher, on_delete=models.CASCADE,)
upload = models.ForeignKey(Upload, on_delete=models.CASCADE, related_name='reports', blank=True, null=True,)
exams_date = models.DateTimeField(null=True, blank=True)
#Other fields that don't matter
class ExamCycle(models.Model):
student = models.ForeignKey(student, on_delete=models.CASCADE,)
headroom_teacher = models.ForeignKey(Teacher, on_delete=models.CASCADE,)
#Other fields that don't matter
class RecommendedClasses(models.Model):
report = models.ForeignKey(Report, on_delete=models.CASCADE,)
range_start = models.DateField(null=True)
range_end = models.DateField(null=True)
# Other fields that don't matter
class StudentStatus(models.TextChoices):
enrolled = 'enrolled' #started class
anxious_for_exams = 'anxious_for_exams'
sticked_with_it = 'sticked_with_it' #already passed one cycle
So this app will help the management of a Cram school. We first do an initial report of the student and its best/worst subjects in StudentReport. Then a RecommendedClasses object is created that tells him which clases he should enroll in. Finally, we have a cycle of exams (let's say 4 times a year). After he completes each exam, another report is created and he can be recommended a new class or to move on the next level of its previous class.
I'll use the choices in StudentStatus to calculate an annotated field that I will call status on my RecommendedClasses report model. I'm having issues with the sticked_with_it status because it's a query that it's done after one cycle is completed and two reports have been made (Two because this query must be done in StudentStatus, after 2nd Report is created). A 'sticked_with_it' student has a report created after exams_date where RecommendedClasses was created and the future exams_date time value falls within the 30 days before range_start and 60 days after the range_end values of the recommendation (Don't question this, it's just the way the higherups want the status)
I have already come up with two ways to do it, but one is with a RAW SQL query and the other is waaay to complicated and slow. Here it is:
SELECT rec.id AS rec_id FROM
school_recommendedclasses rec LEFT JOIN
school_report original_report
ON rec.report_id = original_report.id
AND rec.teacher_id = original_report.teacher_id
JOIN reports_report2 future_report
ON future_report.exams_date > original_report.exams_date
AND future_report.student_id = original_report.student_id
AND future_report.`exams_date` > (rec.`range_start` - INTERVAL 30 DAY)
AND future_report.`exams_date` <
(rec.`range_end` + INTERVAL 60 DAY)
AND original_report.student_id = future_report.student_id
How can I transfer this to a proper DJANGO ORM that is not so painfully unoptimized? I'll show you the other way in the comments.
FWIW, I find this easier to read, but there's very little wrong with your query.
Transforming this to your ORM should be straightforward, and any further optimisations are down to indexes...
SELECT r.id rec_id
FROM reports_recommendation r
JOIN reports_report2 o
ON o.id = r.report_id
AND o.provider_id = r.provider_id
JOIN reports_report2 f
ON f.initial_exam_date > o.initial_exam_date
AND f.patient_id = o.patient_id
AND f.initial_exam_date > r.range_start - INTERVAL 30 DAY
AND f.initial_exam_date < r.range_end + INTERVAL 60 DAY
AND f.provider_id = o.provider_id

Checking for overlapping TimeField ranges

I have this model:
class Task(models.Model):
class Meta:
unique_together = ("campaign_id", "task_start", "task_end", "task_day")
campaign_id = models.ForeignKey(Campaign, on_delete=models.DO_NOTHING)
playlist_id = models.ForeignKey(PlayList, on_delete=models.DO_NOTHING)
task_id = models.AutoField(primary_key=True, auto_created=True)
task_start = models.TimeField()
task_end = models.TimeField()
task_day = models.TextField()
I need to write a validation test that checks if a newly created task time range overlaps with an existing one in the database.
For example:
A task with and ID 1 already has a starting time at 5:00PM and ends at 5:15PM on a Saturday. A new task cannot be created between the first task's start and end time. Where should I write this test and what is the most efficent way to do this? I also use DjangoRestFramework Serializers.
When you receive the form data from the user, you can:
Check the fields are consistent: user task_start < user task_end, and warn the user if not.
Query (SELECT) the database to retrieve all existing tasks which intercept the user time,
Order the records by task_start (ORDER BY),
Select only records which validate your criterion, a.k.a.:
task_start <= user task_start <= task_end, or,
task_start <= user task_end <= task_end.
warn the user if at least one record is found.
Everything is OK:
Construct a Task instance,
Store it in database.
Return success.
Implementation details:
task_start and task_end could be indexed in your database to improve selection time.
I saw that you also have a task_day field (which is a TEXT).
You should really consider using UTC DATETIME fields instead of TEXT, because you need to compare date AND time (and not only time): consider a task which starts at 23:30 and finish at 00:45 the day after…
This is how I solved it. It's not optimal by far, but I'm limited to python 2.7 and Django 1.11 and I'm also a beginner.
def validate(self, data):
errors = {}
task_start = data.get('task_start')
task_end = data.get('task_end')
time_filter = Q(task_start__range=[task_start, task_end])
| Q(task_end__range=[task_start, task_end])
filter_check = Task.objects.filter(time_filter).exists()
if task_start > task_end:
errors['error'] = u'End time cannot be earlier than start time!'
raise serializers.ValidationError(errors)
elif filter_check:
errors['errors'] = u'Overlapping tasks'
raise serializers.ValidationError(errors)
else:
pass
return data

Speed up Django query

I am working with Django to create a dashboard which present many kind of data. My problem is that the page loading slowly despite I hit the database (PostgreSql) always once. These tables are loading with data in every 10th minute, so currently consist of millions of record. My problem is that when I make a query with Django ORM, I get the data slowly (according to the Django toolbar it is 1,4 second). I know that this not too much b is the half of the total loading time (3,1), so If I could decrease the time of the query the page loading could decrease to there for the user experience could be better. When the query run I fetch ~ 2800 rows. Is there any way to speed up this query? I do not know that I do something wrong or this time is normal with this amount of data. I attach my query and model. Thank you in advance for your help.
My query (Here I fetch 6 hours time intervall.):
my_query=MyTable.filter(time_stamp__range=(before_now, now)).values('time_stamp', 'value1', 'value2')
Here I tried to use .iterator() but the query wasn't faster.
My model:
class MyTable(models.Model):
time_stamp = models.DateTimeField()
value1 = models.FloatField(blank=True, null=True)
values2 = models.FloatField(blank=True, null=True)
Add an index:
class MyTable(models.Model):
time_stamp = models.DateTimeField()
value1 = models.FloatField(blank=True, null=True)
values2 = models.FloatField(blank=True, null=True)
class Meta:
indexes = [
models.Index(fields=['time_stamp']),
]
Don't forget to run manage.py makemigrations and manage.py migrate after this.

Creating a query with foreign keys and grouping by some data in Django

I thought about my problem for days and i need a fresh view on this.
I am building a small application for a client for his deliveries.
# models.py - Clients app
class ClientPR(models.Model):
title = models.CharField(max_length=5,
choices=TITLE_LIST,
default='mr')
last_name = models.CharField(max_length=65)
first_name = models.CharField(max_length=65, verbose_name='Prénom')
frequency = WeekdayField(default=[]) # Return a CommaSeparatedIntegerField from 0 for Monday to 6 for Sunday...
[...]
# models.py - Delivery app
class Truck(models.Model):
name = models.CharField(max_length=40, verbose_name='Nom')
description = models.CharField(max_length=250, blank=True)
color = models.CharField(max_length=10,
choices=COLORS,
default='green',
unique=True,
verbose_name='Couleur Associée')
class Order(models.Model):
delivery = models.ForeignKey(OrderDelivery, verbose_name='Delivery')
client = models.ForeignKey(ClientPR)
order = models.PositiveSmallIntegerField()
class OrderDelivery(models.Model):
date = models.DateField(default=d.today())
truck = models.ForeignKey(Truck, verbose_name='Camion', unique_for_date="date")
So i was trying to get a query and i got this one :
ClientPR.objects.today().filter(order__delivery__date=date.today())
.order_by('order__delivery__truck', 'order__order')
But, i does not do what i really want.
I want to have a list of Client obj (query sets) group by truck and order by today's delivery order !
The thing is, i want to have EVERY clients for the day even if they are not in the delivery list and with filter, that cannot be it.
I can make a query with OrderDelivery model but i will only get the clients for the delivery, not all of them for the day...
Maybe i will need to do it with a Q object ? or even raw SQL ?
Maybe i have built my models relationships the wrong way ? Or i need to lower what i want to do... Well, for now, i need your help to see the problem with new eyes !
Thanks for those who will take some time to help me.
After some tests, i decided to go with 2 querys for one table.
One from OrderDelivery Queryset for getting a list of clients regroup by Trucks and another one from ClientPR Queryset for all the clients without a delivery set for them.
I that way, no problem !

Django aggregation over filtered query

Assuming I have these two models :
class User(models.model):
username = models.CharField(max_length=255)
class Item(models.model):
user = models.ForeignKey( User )
enabled = models.BooleanField()
price = models.IntegerField()
I would like to create an optimal query and get : Top 10 users which have at least 10 enabled
items and the highest average price of that total items (sorted by best avarage)
In other words I am trying to create a top 10 "leaderboard" on my site for the users that own an average best priced items, however some of the items may be disabled and still exist on my database, I am trying to get rid of them on my ORM query but cant find a good way of doing it.
This operation is run every 5 minutes or so, it is not running while generating a page.
I can't test right now, but I think this should work:
topusers = User.objects.prefetch_related(
'item_set'
).filter(
item_set__enabled=True
).annotate(
item_count=models.Count('item_set'),
avg_price=models.Avg('item_set__price')
).filter(
item_count__gte=10
).order_by('-avg_price').all()