Is it possible to check how many rows were deleted by a query?
queryset = MyModel.object.filter(foo=bar)
queryset.delete()
deleted = ...
Or should I use transactions for that?
#transaction.commit_on_success
def delete_some_rows():
queryset = MyModel.object.filter(foo=bar)
deleted = queryset.count()
queryset.delete()
PHP + MySQL example:
mysql_query('DELETE FROM mytable WHERE id < 10');
printf("Records deleted: %d\n", mysql_affected_rows());
There are many situations where you want to know how many rows were deleted, for example if you do something based on how many rows were deleted. Checking it by performing a COUNT creates extra database load and is not atomic.
The queryset.delete() method immediately deletes the object and returns the number of objects deleted and a dictionary with the number of deletions per object type.
Check the docs for more details: https://docs.djangoproject.com/en/stable/topics/db/queries/#deleting-objects
Actual rows affected you could view with SELECT row_count().
First of all qs.count() and cursor.rowcount is not same things!
In MySQL with InnoDB with REPEATABLE READ (default mode) READ queries and WRITE queries view !!DIFFERENT!!! querysets!
READ queries read from old snapshot, while WRITE queries view actual committed data, like they works in READ COMMITTED mode.
Related
I'm trying to use select_related/prefetch_related to optimize some queries. However I have issues in "forcing" the queries to be evaluated all at once.
Say I'm doing the following:
fp_query = Fourprod.objects.filter(choisi=True).select_related("fk_fournis")
pf = Prefetch("fourprod", queryset=fp_query) #
products = Products.objects.filter(id__in=fp_query).prefetch_related(pf)
With models:
class Fourprod(models.Model):
fk_produit = models.ForeignKey(to=Produit, related_name="fourprod")
fk_fournis = models.ForeignKey(to=Fournis,related_name="fourprod")
choisi = models.BooleanField(...)
class Produit(models.Model):
... ordinary fields...
class Fournis(models.Model):
... ordinary fields...
So essentially, Fourprod has a fk to Fournis, Produit, and I want to prefetch those when I build the Produits queryset. I've checked in debug that the prefetch actually occurs and it does.
I have a bunch of fields from different models I need to use to compute results. I don't really control the table structure, so I have to work with this. I can't come up with a reasonable query to do it all with the queries (or using raw), so I want to compute stuff python-side. It's a few 1000 objects, so reasonable to do in-memory. So I cast to a list to force the query evaluation:
products = list(products)
At this point, I would think that the Products and the related objects that I have pre-fetched should have been fetched from the DB. In the logs, just after the list() call, I get this:
02/08/22 15:21:08 DEBUG DEFAULT: (0.019) SELECT "products_fourprod"."id", "products_fourprod"."fk_produit_id", "products_fourprod"."fk_fournis_id", "products_fourprod"."choisi", "products_fourprod"."code_four", "products_fourprod"."prix", "products_fourprod"."comment", "products_fournis"."id", "products_fournis"."fk_user_create_id", "products_fournis"."nom", "products_fournis"."adresse", "products_fournis"."ville", "products_fournis"."tel", "products_fournis"."fax", "products_fournis"."contact", "products_fournis"."note", "products_fournis"."pays", "products_fournis"."province", "products_fournis"."postal", "products_fournis"."monnaie", "products_fournis"."tel_long", "products_fournis"."inactif", "products_fournis"."inuse", "products_fournis"."par", "products_fournis"."fk_langue", "products_fournis"."NOTE2" FROM "products_fourprod" LEFT OUTER JOIN "products_fournis" ON ("products_fourprod"."fk_fournis_id" = "products_fournis"."id") WHERE ("products_fourprod"."choisi" AND "products_fourprod"."fk_produit_id" IN (... all Product.id meeting the conditions...)
But then, the list comprehension using the products takes forever to complete:
rows = [[p.id, p.fourprod.first().id, p.desuet, p.no_prod, ... ] for p in products]
With apparently each single call to p.fourprod resulting in a DB hit:
02/08/22 15:26:19 DEBUG DEFAULT: (0.000) SELECT "products_fourprod"."id", "products_fourprod"."fk_produit_id", "products_fourprod"."fk_fournis_id", "products_fourprod"."choisi", "products_fourprod"."code_four", "products_fourprod"."prix", "products_fourprod"."comment", "products_fournis"."id", "products_fournis"."fk_user_create_id", "products_fournis"."nom", "products_fournis"."adresse", "products_fournis"."ville", "products_fournis"."tel", "products_fournis"."fax", "products_fournis"."contact", "products_fournis"."note", "products_fournis"."pays", "products_fournis"."province", "products_fournis"."postal", "products_fournis"."monnaie", "products_fournis"."tel_long", "products_fournis"."inactif", "products_fournis"."inuse", "products_fournis"."par", "products_fournis"."fk_langue", "products_fournis"."NOTE2" FROM "products_fourprod" LEFT OUTER JOIN "products_fournis" ON ("products_fourprod"."fk_fournis_id" = "products_fournis"."id") WHERE ("products_fourprod"."choisi" AND "products_fourprod"."fk_produit_id" = 1185) ORDER BY "products_fourprod"."id" ASC LIMIT 1; args=(1185,)
02/08/22 15:26:19 DEBUG DEFAULT: (0.000) SELECT "products_fourprod"."id", (.... more similar db hits... )
If I remove all the uses of related objects, then the list() call has actually forced the db hit already and the query executes quickly.
So.... if simply calling products = list(products) does not force the db to be queried for the prefetched objects as well, is there any ways I can make django's orm do so?
From the docs:
Remember that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously cached results, and retrieve data using a fresh database query.
first() implies a database query, so that will cause your query to not use the prefetched values.
Try to use p.fourprod.all()[0] instead to access the first related fourprod instead.
I'm using Django 1.11 with MySQL. Upgrading to 2 isn't feasible in the short term so isn't an acceptable solution to my immediate problem, but answers referring to Django 2 may help others so feel free to post them.
I need to perform a data migration on all rows in a table. There are less than 40000 rows but they are quite big - two of the columns are ~15KB of JSON which get parsed when the model is loaded. (These are the rows I need to use in the data migration so I cannot defer them)
So as not to load all the objects into memory simultaneously, I thought I'd use queryset.iterator which only parses rows 100 at time. This works fine if all I do is read the results, but if I perform another query (eg to save one of the objects) then once I reach the end of the current chunk of 100 results, the next chunk of 100 results are not fetched and the iterator finishes.
It's as if the result set that fetchmany fetches the rows from has been lost.
To illustrate the scenario using ./manage.py shell
(Assume there exist 40000 MyModel with sequential ids)
iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
print(obj.id)
The above prints the ids 1 to 40000 as expected.
iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
print(obj.id)
obj.save()
The above only prints the ids 1 to 100
iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
print(obj.id)
if obj.id == 101:
obj.save()
The above only prints the ids 1 to 200
Replacing obj.save with anything else that makes a query to the DB (eg app.models.OtherModel.objects.first()) has the same result.
Is it simply not possible to make another query while using queryset iterator? Is there another way to achieve the same thing?
Thanks
As suggested by #dirkgroten, Paginator is an alternative to iterator that's potentially a better solution in terms of memory usage as it uses slicing on the queryset which adds OFFSET and LIMIT clauses to retrieve only part of the full result set.
However, high OFFSET values incur a performance penalty on MySQL: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
Therefore seeking on an indexed column may be a better option:
chunk_size = 100
seek_id = 0
next_seek_id = -1
while seek_id != next_seek_id:
seek_id = next_seek_id
for obj in app.models.MyModel.objects.filter(id__gt=seek_id)[:chunk_size]:
next_seek_id = obj.id
# do your thing
Additionally, if your data is such that performing the query isn't expensive but instantiating model instances is, iterator has the potential advantage of doing a single database query. Hopefully other answers will be able to shed light on the use of queryset.iterator with other queries.
I want to retrieve records form table, initialize some fields, and then update data on memory (not saving to database) after some filter. But this second filter makes a new query to the database, so I can't initialize these fields.
How can I solve this issue?
res=where(code: 0) # (1) retrieve objects from db, as a template
res.each{|r| r.amount=10} # (2) change amount field with 10 in res
Here I expect 10+2 but the result is 0+2: because next line executes query on database, so line (2) it has no effect
res.where("caconana='1221").first.amount+= 2
Possible answer: Is there a better solution than convert 'res' to an array?
res=where(code: 0).to_a
res.each{|r| r.amount=10}
i=res.find_index{|r| r.caconana=='1221'}
res[i]+=2
Thanks to PSKocit I have the solution: find method does not execute query to database
That each method call on res will cause all the matching ActiveRecords object to be fully constructed anyway. So if you do want them constructed, you might as well use the caconana attributes that got loaded:
res.each {|r| r.amount = 10 }
res.find {|r| r.caconana == 1221 }.amount+=2
(Then you'd need to save each record to have that committed to the database.)
You can avoid some ActiveRecord object construction with:
where(code: 0).update_all(amount: 10) #One query to the database, no AR objects constructed in memory
Then for the incrementing (this is with AR object construction, but it's just for one record)
obj = where(code: 0).where(caconana: 1221).limit(1).first
obj.amount+=2
#obj.save!
I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was wondering if there was a way to tell django to ignore such rows and save as much of the batch as possible?
This is now possible on Django 2.2
Django 2.2 adds a new ignore_conflicts option to the bulk_create method, from the documentation:
On databases that support it (all except PostgreSQL < 9.5 and Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. Enabling this parameter disables setting the primary key on each model instance (if the database normally supports it).
Example:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
], ignore_conflicts=True)
One quick-and-dirty workaround for this that doesn't involve manual SQL and temporary tables is to just attempt to bulk insert the data. If it fails, revert to serial insertion.
objs = [(Event), (Event), (Event)...]
try:
Event.objects.bulk_create(objs)
except IntegrityError:
for obj in objs:
try:
obj.save()
except IntegrityError:
continue
If you have lots and lots of errors this may not be so efficient (you'll spend more time serially inserting than doing so in bulk), but I'm working through a high-cardinality dataset with few duplicates so this solves most of my problems.
(Note: I don't use Django, so there may be more suitable framework-specific answers)
It is not possible for Django to do this by simply ignoring INSERT failures because PostgreSQL aborts the whole transaction on the first error.
Django would need one of these approaches:
INSERT each row in a separate transaction and ignore errors (very slow);
Create a SAVEPOINT before each insert (can have scaling problems);
Use a procedure or query to insert only if the row doesn't already exist (complicated and slow); or
Bulk-insert or (better) COPY the data into a TEMPORARY table, then merge that into the main table server-side.
The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.
Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGED or TEMPORARY, then I'd run some manual SQL to:
LOCK TABLE realtable IN EXCLUSIVE MODE;
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);
The LOCK TABLE ... IN EXCLUSIVE MODE prevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does not prevent concurrent SELECTs, only SELECT ... FOR UPDATE, INSERT,UPDATE and DELETE, so reads from the table carry on as normal.
If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptable into realtable, retrying each block if it failed.
Or 5. Divide and conquer
I didn't test or benchmark this thoroughly, but it performs pretty well for me. YMMV, depending in particular on how many errors you expect to get in a bulk operation.
def psql_copy(records):
count = len(records)
if count < 1:
return True
try:
pg.copy_bin_values(records)
return True
except IntegrityError:
if count == 1:
# found culprit!
msg = "Integrity error copying record:\n%r"
logger.error(msg % records[0], exc_info=True)
return False
finally:
connection.commit()
# There was an integrity error but we had more than one record.
# Divide and conquer.
mid = count / 2
return psql_copy(records[:mid]) and psql_copy(records[mid:])
# or just return False
Even in Django 1.11 there is no way to do this. I found a better option than using Raw SQL. It using djnago-query-builder. It has an upsert method
from querybuilder.query import Query
q = Query().from_table(YourModel)
# replace with your real objects
rows = [YourModel() for i in range(10)]
q.upsert(rows, ['unique_fld1', 'unique_fld2'], ['fld1_to_update', 'fld2_to_update'])
Note: The library only support postgreSQL
Here is a gist that I use for bulk insert that supports ignoring IntegrityErrors and returns the records inserted.
Late answer for pre Django 2.2 projects :
I ran into this situation recently and I found my way out with a seconder list array for check the uniqueness.
In my case, the model has that unique together check, and bulk create is throwing Integrity Error exception because of the array of bulk create has duplicate data in it.
So I decided to create checklist besides bulk create objects list. Here is the sample code; The unique keys are owner and brand, and in this example owner is an user object instance and brand is a string instance:
create_list = []
create_list_check = []
for brand in brands:
if (owner.id, brand) not in create_list_check:
create_list_check.append((owner.id, brand))
create_list.append(ProductBrand(owner=owner, name=brand))
if create_list:
ProductBrand.objects.bulk_create(create_list)
it's work for me
i am use this this funtion in thread.
my csv file contains 120907 no of rows.
def products_create():
full_path = os.path.join(settings.MEDIA_ROOT,'productcsv')
filename = os.listdir(full_path)[0]
logger.debug(filename)
logger.debug(len(Product.objects.all()))
if len(Product.objects.all()) > 0:
logger.debug("Products Data Erasing")
Product.objects.all().delete()
logger.debug("Products Erasing Done")
csvfile = os.path.join(full_path,filename)
csv_df = pd.read_csv(csvfile,sep=',')
csv_df['HSN Code'] = csv_df['HSN Code'].fillna(0)
row_iter = csv_df.iterrows()
logger.debug(row_iter)
logger.debug("New Products Creating")
for index, row in row_iter:
Product.objects.create(part_number = row[0],
part_description = row[1],
mrp = row[2],
hsn_code = row[3],
gst = row[4],
)
# products_list = [
# Product(
# part_number = row[0] ,
# part_description = row[1],
# mrp = row[2],
# hsn_code = row[3],
# gst = row[4],
# )
# for index, row in row_iter
# ]
# logger.debug(products_list)
# Product.objects.bulk_create(products_list)
logger.debug("Products uploading done")```
When you pass a list of pk integers into add or remove - are the objects accessed? ie. Is there a database call for each pk?
When you create a ManyToManyField without specifying an intermediary table (using through) Django generates a table for you. This table will only need the pks of both models, so there's no need to select anything from the other objects in order to save the new relashionships.
A new row will be created for each of them, possibly using many inserts, but not necessarily many database calls (a single SQL query with multiple insert commands, for instance, is possible). All the info needed for those creations (the pks of your objects) are readily available, so no need for any more database hits than necessary.
Update: seems I was mistaken. Looking at the sources (django/db/models/fields/related.py), I saw that it performs an independent creation for each object:
for obj_id in new_ids:
self.through._default_manager.using(db).create(**{
'%s_id' % source_field_name: self._pk_val,
'%s_id' % target_field_name: obj_id,
})
Before doing that, it also checks if any of the pks supplied already existed in the database (in order to avoid duplicate entries/uniqueness constraint violations):
vals = self.through._default_manager.using(db).values_list(target_field_name, flat=True)
vals = vals.filter(**{
source_field_name: self._pk_val,
'%s__in' % target_field_name: new_ids,
})
new_ids = new_ids - set(vals)
This check is done with a single query though...
Did you tried to check that by yourself using QuerySet.query attribute?