When you pass a list of pk integers into add or remove - are the objects accessed? ie. Is there a database call for each pk?
When you create a ManyToManyField without specifying an intermediary table (using through) Django generates a table for you. This table will only need the pks of both models, so there's no need to select anything from the other objects in order to save the new relashionships.
A new row will be created for each of them, possibly using many inserts, but not necessarily many database calls (a single SQL query with multiple insert commands, for instance, is possible). All the info needed for those creations (the pks of your objects) are readily available, so no need for any more database hits than necessary.
Update: seems I was mistaken. Looking at the sources (django/db/models/fields/related.py), I saw that it performs an independent creation for each object:
for obj_id in new_ids:
self.through._default_manager.using(db).create(**{
'%s_id' % source_field_name: self._pk_val,
'%s_id' % target_field_name: obj_id,
})
Before doing that, it also checks if any of the pks supplied already existed in the database (in order to avoid duplicate entries/uniqueness constraint violations):
vals = self.through._default_manager.using(db).values_list(target_field_name, flat=True)
vals = vals.filter(**{
source_field_name: self._pk_val,
'%s__in' % target_field_name: new_ids,
})
new_ids = new_ids - set(vals)
This check is done with a single query though...
Did you tried to check that by yourself using QuerySet.query attribute?
Related
I'm trying to use select_related/prefetch_related to optimize some queries. However I have issues in "forcing" the queries to be evaluated all at once.
Say I'm doing the following:
fp_query = Fourprod.objects.filter(choisi=True).select_related("fk_fournis")
pf = Prefetch("fourprod", queryset=fp_query) #
products = Products.objects.filter(id__in=fp_query).prefetch_related(pf)
With models:
class Fourprod(models.Model):
fk_produit = models.ForeignKey(to=Produit, related_name="fourprod")
fk_fournis = models.ForeignKey(to=Fournis,related_name="fourprod")
choisi = models.BooleanField(...)
class Produit(models.Model):
... ordinary fields...
class Fournis(models.Model):
... ordinary fields...
So essentially, Fourprod has a fk to Fournis, Produit, and I want to prefetch those when I build the Produits queryset. I've checked in debug that the prefetch actually occurs and it does.
I have a bunch of fields from different models I need to use to compute results. I don't really control the table structure, so I have to work with this. I can't come up with a reasonable query to do it all with the queries (or using raw), so I want to compute stuff python-side. It's a few 1000 objects, so reasonable to do in-memory. So I cast to a list to force the query evaluation:
products = list(products)
At this point, I would think that the Products and the related objects that I have pre-fetched should have been fetched from the DB. In the logs, just after the list() call, I get this:
02/08/22 15:21:08 DEBUG DEFAULT: (0.019) SELECT "products_fourprod"."id", "products_fourprod"."fk_produit_id", "products_fourprod"."fk_fournis_id", "products_fourprod"."choisi", "products_fourprod"."code_four", "products_fourprod"."prix", "products_fourprod"."comment", "products_fournis"."id", "products_fournis"."fk_user_create_id", "products_fournis"."nom", "products_fournis"."adresse", "products_fournis"."ville", "products_fournis"."tel", "products_fournis"."fax", "products_fournis"."contact", "products_fournis"."note", "products_fournis"."pays", "products_fournis"."province", "products_fournis"."postal", "products_fournis"."monnaie", "products_fournis"."tel_long", "products_fournis"."inactif", "products_fournis"."inuse", "products_fournis"."par", "products_fournis"."fk_langue", "products_fournis"."NOTE2" FROM "products_fourprod" LEFT OUTER JOIN "products_fournis" ON ("products_fourprod"."fk_fournis_id" = "products_fournis"."id") WHERE ("products_fourprod"."choisi" AND "products_fourprod"."fk_produit_id" IN (... all Product.id meeting the conditions...)
But then, the list comprehension using the products takes forever to complete:
rows = [[p.id, p.fourprod.first().id, p.desuet, p.no_prod, ... ] for p in products]
With apparently each single call to p.fourprod resulting in a DB hit:
02/08/22 15:26:19 DEBUG DEFAULT: (0.000) SELECT "products_fourprod"."id", "products_fourprod"."fk_produit_id", "products_fourprod"."fk_fournis_id", "products_fourprod"."choisi", "products_fourprod"."code_four", "products_fourprod"."prix", "products_fourprod"."comment", "products_fournis"."id", "products_fournis"."fk_user_create_id", "products_fournis"."nom", "products_fournis"."adresse", "products_fournis"."ville", "products_fournis"."tel", "products_fournis"."fax", "products_fournis"."contact", "products_fournis"."note", "products_fournis"."pays", "products_fournis"."province", "products_fournis"."postal", "products_fournis"."monnaie", "products_fournis"."tel_long", "products_fournis"."inactif", "products_fournis"."inuse", "products_fournis"."par", "products_fournis"."fk_langue", "products_fournis"."NOTE2" FROM "products_fourprod" LEFT OUTER JOIN "products_fournis" ON ("products_fourprod"."fk_fournis_id" = "products_fournis"."id") WHERE ("products_fourprod"."choisi" AND "products_fourprod"."fk_produit_id" = 1185) ORDER BY "products_fourprod"."id" ASC LIMIT 1; args=(1185,)
02/08/22 15:26:19 DEBUG DEFAULT: (0.000) SELECT "products_fourprod"."id", (.... more similar db hits... )
If I remove all the uses of related objects, then the list() call has actually forced the db hit already and the query executes quickly.
So.... if simply calling products = list(products) does not force the db to be queried for the prefetched objects as well, is there any ways I can make django's orm do so?
From the docs:
Remember that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously cached results, and retrieve data using a fresh database query.
first() implies a database query, so that will cause your query to not use the prefetched values.
Try to use p.fourprod.all()[0] instead to access the first related fourprod instead.
I am trying to run prefetch as below and gives me error: Prefetch querysets cannot use raw(), values(), and values_list().
queryset = SymbolList.objects.prefetch_related(
Prefetch('stock_dailypricehistory_symbol',queryset = DailyPriceHistory.objects.filter(id__in=[1,2,3,4]).values('close','volume'))).all()
DailyPriceHistory model has lot of columns, i only want close and volume
Here id__in=[1,2,3,4] is just shown for example, it can go upto 10000+ ids at a time.
You can indeed not make use of .values(…) [Django-doc] and .values_list(…) [Django-doc] when prefetching. That would also result in extra complications, since then Django has more trouble to find out how to perform the "joining" at the Django/Python level, if this is even possible.
You however do not need to use .values(…) to limit the number of columns you fetch, you can use .only(…) [Django-doc] instead. .only(…) will only fetch the given columns and the primary key immediately in memory, and will "postpone" fetching other columns. Only if you later need these columns, it will make extra queries. This is thus a bit similar how Django will only fetch the related object of a ForeignKey in memory when you need it with an extra query.
queryset = SymbolList.objects.prefetch_related(
Prefetch(
'stock_dailypricehistory_symbol',
queryset = DailyPriceHistory.objects.filter(
id__in=[1,2,3,4]
).only('close', 'volume', 'symbol_id')
)
)
The symbol_id is the column of the ForeignKey that refers from the DailyPriceHistory to the SymbolList model, this might be a different field than symbol_id. If you do not include this, Django will have to make extra queries to find out how it can link DailyPriceHistory to the corresponding SymbolList, and this will only slow down processing with extra queries, therefore you better include this as well.
I need to be able to quickly bulk insert large amounts of records quickly, while still ensuring uniqueness in the database. The new records to be inserted have already been parsed, and are unique. I'm hoping there is a way to enforce uniqueness at the database level, and not in the code itself.
I'm using MySQL as the database backend. If django supports this functionality in any other database, I am flexible in changing the backend, as this is a requirement.
Bulk inserts in Django don't use the save method, so how can I insert several hundred to several thousand records at a time, while still respecting unique fields and unique together fields?
My model structures, simplified, look something like this:
class Example(models.Model):
Meta:
unique_together = (('name', 'number'),)
name = models.CharField(max_length = 50)
number = models.CharField(max_length = 10)
...
fk = models.ForeignKey(OtherModel)
Edit:
The records that aren't already in the database should be inserted, and the records that already existed should be ignored.
As miki725's mentioned you don't have a problem with your current code.
I'm assuming you are using the bulk_create method. It is true that the save() method is not called when using bulk_create, but the uniqueness of fields is not enforced inside the save() method. When you use unique_together a unique constraint is added to the underlying table in mysql when creating the table:
Django:
unique_together = (('name', 'number'),)
MySQL:
UNIQUE KEY `name` (`name`,`number`)
So if you insert a value into the table using any method (save, bulk_insert or even raw sql) you will get this exception from mysql:
Duplicate entry 'value1-value2' for key 'name'
UPDATE:
What bulk_insert does is that it creates one big query that inserts all the data at once with one query. So if one of the entries is duplicate, it throws an exception and none of the data is inserted.
1- One option is to use batch_size parameter of bulk_insert and make it insert the data in a number of batches so that if one of them fails you only miss rest of the data of that batch. (depends how important it is to insert all the data and how frequent the duplicate entries are)
2- Another option is to write a for loop over the bulk data and insert the bulk data one by one. This way the exception is thrown for that one row only and the rest of the data is inserted. This is gonna query the db every time and is of course a lot slower.
3- Third option is to lift the unique constraint, insert the data using bulk_create and then write a simple query that deletes the duplicate rows.
Django itself does not enforce the unique_together meta attribute. This is enforced by the database using the UNIQUE clause. You can insert as much data as you want and you are guaranteed that the specified fields will be unique. If not, then an exception will be raised (not sure which one). More about unique_together in the docs.
I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()
Is it possible to check how many rows were deleted by a query?
queryset = MyModel.object.filter(foo=bar)
queryset.delete()
deleted = ...
Or should I use transactions for that?
#transaction.commit_on_success
def delete_some_rows():
queryset = MyModel.object.filter(foo=bar)
deleted = queryset.count()
queryset.delete()
PHP + MySQL example:
mysql_query('DELETE FROM mytable WHERE id < 10');
printf("Records deleted: %d\n", mysql_affected_rows());
There are many situations where you want to know how many rows were deleted, for example if you do something based on how many rows were deleted. Checking it by performing a COUNT creates extra database load and is not atomic.
The queryset.delete() method immediately deletes the object and returns the number of objects deleted and a dictionary with the number of deletions per object type.
Check the docs for more details: https://docs.djangoproject.com/en/stable/topics/db/queries/#deleting-objects
Actual rows affected you could view with SELECT row_count().
First of all qs.count() and cursor.rowcount is not same things!
In MySQL with InnoDB with REPEATABLE READ (default mode) READ queries and WRITE queries view !!DIFFERENT!!! querysets!
READ queries read from old snapshot, while WRITE queries view actual committed data, like they works in READ COMMITTED mode.