Django ORM and locking table - django

My problem is as follows:
I have a car dealer A, and a db table named sold_cars. When a car is being sold I create entry in this table.
Table has an integer column named order_no. It should be unique within cars sold by dealer.
So if dealer A sold cars a, b and c, then this column should be 1, 2, 3. I have to use this column, and not a primary key because I don't want to have any holes in my numeration - dealer A and B (which might be added later) should have order numbers 1, 2, 3, and not A: 1, 3, 5, and B: 2, 4, 6. So... I select last greatest order_no for given dealer, increment it by 1 and save.
Problem is that two people bought car from dealer A in the same millisecond and both orders got the same order_no. Any advice? I was thinking of closing this process in a transaction block, and locking this table until the transaction is complete, but can't find any info on how to to that.

I know this question is a bit older, but I just had the same issue and wanted to share my learnings.
I wasn't quite satisfied with st0nes answer, since (at least for postgres) a LOCK TABLE statement can only be issued within a transaction. And although in Django usually almost everything happens within a transaction, this LockingManager does not make sure, that you actually are within a transaction, at least to my understanding. Also I didn't want to completely change the Models Manager just to be able to lock it at one spot and therefore I was more looking for something that works kinda like the with transaction.atomic():, but also locks a given Model.
So I came up with this:
from django.conf import settings
from django.db import DEFAULT_DB_ALIAS
from django.db.transaction import Atomic, get_connection
class LockedAtomicTransaction(Atomic):
"""
Does a atomic transaction, but also locks the entire table for any transactions, for the duration of this
transaction. Although this is the only way to avoid concurrency issues in certain situations, it should be used with
caution, since it has impacts on performance, for obvious reasons...
"""
def __init__(self, model, using=None, savepoint=None):
if using is None:
using = DEFAULT_DB_ALIAS
super().__init__(using, savepoint)
self.model = model
def __enter__(self):
super(LockedAtomicTransaction, self).__enter__()
# Make sure not to lock, when sqlite is used, or you'll run into problems while running tests!!!
if settings.DATABASES[self.using]['ENGINE'] != 'django.db.backends.sqlite3':
cursor = None
try:
cursor = get_connection(self.using).cursor()
cursor.execute(
'LOCK TABLE {db_table_name}'.format(db_table_name=self.model._meta.db_table)
)
finally:
if cursor and not cursor.closed:
cursor.close()
So if I now want to lock the model ModelToLock, this can be used like this:
with LockedAtomicTransaction(ModelToLock):
# do whatever you want to do
ModelToLock.objects.create()
EDIT: Note that I have only tested this using postgres. But to my understanding, it should also work on mysql just like that.

from contextlib import contextmanager
from django.db import transaction
from django.db.transaction import get_connection
#contextmanager
def lock_table(model):
with transaction.atomic():
cursor = get_connection().cursor()
cursor.execute(f'LOCK TABLE {model._meta.db_table}')
try:
yield
finally:
cursor.close()
This is very similar to #jdepoix solution, but a bit more concise.
You can use it like this:
with lock_table(MyModel):
MyModel.do_something()
Note that this only works with PostgreSQL and uses python 3.6's f-strings a.k.a. literal string interpolation.

I would recommend using the F() expression instead of locking the entire table. If your app is being heavily used, locking the table will have significant performance impact.
The exact scenario you described is mentioned in Django documentation here. Based on your scenario, here's the code you can use:
from django.db.models import F
# Populate sold_cars as you normally do..
# Before saving, use the "F" expression
sold_cars.order_num =F('order_num') + 1
sold_cars.save()
# You must do this before referring to order_num:
sold_cars.refresh_from_db()
# Now you have the database-assigned order number in sold_cars.order_num
Note that if you set order_num during an update operation, use the following instead:
sold_cars.update(order_num=F('order_num')+1)
sold_cars.refresh_from_db()
Since database is in charge of updating the field, there won't be any race conditions or duplicated order_num values. Plus, this approach is much faster than one with locked tables.

I think this code snippet meets your need, assuming you are using MySQL. If not, you may need to tweak the syntax a little, but the idea should still work.
Source: Locking tables
class LockingManager(models.Manager):
""" Add lock/unlock functionality to manager.
Example::
class Job(models.Model):
manager = LockingManager()
counter = models.IntegerField(null=True, default=0)
#staticmethod
def do_atomic_update(job_id)
''' Updates job integer, keeping it below 5 '''
try:
# Ensure only one HTTP request can do this update at once.
Job.objects.lock()
job = Job.object.get(id=job_id)
# If we don't lock the tables two simultanous
# requests might both increase the counter
# going over 5
if job.counter < 5:
job.counter += 1
job.save()
finally:
Job.objects.unlock()
"""
def lock(self):
""" Lock table.
Locks the object model table so that atomic update is possible.
Simulatenous database access request pend until the lock is unlock()'ed.
Note: If you need to lock multiple tables, you need to do lock them
all in one SQL clause and this function is not enough. To avoid
dead lock, all tables must be locked in the same order.
See http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html
"""
cursor = connection.cursor()
table = self.model._meta.db_table
logger.debug("Locking table %s" % table)
cursor.execute("LOCK TABLES %s WRITE" % table)
row = cursor.fetchone()
return row
def unlock(self):
""" Unlock the table. """
cursor = connection.cursor()
table = self.model._meta.db_table
cursor.execute("UNLOCK TABLES")
row = cursor.fetchone()
return row

I had the same problem. The F() solution solves a different problem. It doesn't get the max(order_no) for all sold_cars rows for a specific car dealer, but rather provides a way to update the value of order_no based on the value that already set in the field for a particular row.
Locking entire table is an overkill here, it's sufficient to lock only specific dealer's rows.
Below is the solution I ended up with. The code assumes sold_cars table references dealers table using sold_cars.dealer field. Imports, logging and error handling omitted for clarity:
DEFAULT_ORDER_NO = 0
def save_sold_car(sold_car, dealer):
# update sold_car instance as you please
with transaction.atomic():
# to successfully use locks the processes must query for row ranges that
# intersect. If no common rows are present, no locks will be set.
# We save the sold_car entry without an order_no to create at least one row
# that can be locked. If order_no assignment fails later at some point,
# the transaction will be rolled back and the 'incomplete' sold_car entry
# will be removed
sold_car.save()
# each process adds its own sold_car entry. Concurrently getting sold_cars
# by their dealer may result in row ranges which don't intersect.
# For example process A saves sold_car 'a1' the same moment process B saves
# its 'b1' sold_car. Then both these processes get sold_cars for the same
# dealer. Process A gets single 'a1' row, while process B gets
# single 'b1' row.
# Since all the sold_cars here belong to the same dealer, adding the
# related dealer's row to each range with 'select_related' will ensure
# having at least one common row to acquire the lock on.
dealer_sold_cars = (SoldCar.objects.select_related('dealer')
.select_for_update()
.filter(dealer=dealer))
# django queries are lazy, make sure to explicitly evaluate them
# to acquire the locks
len(dealer_sold_cars)
max_order_no = (dealer_sold_cars.aggregate(Max('order_no'))
.get('order_no__max') or DEFAULT_ORDER_NO)
sold_car.order_no = max_order_no + 1
sold_car.save()

Related

How to create an auto increment field for a django model, with two keys which are unique_together?

I have a model like
class Item(models.Model):
site = Site()
id_on_site = PositiveIntegerField()
Now i want to create an instance Item(current_site, next_id_on_site) with
next_id_on_site = Item.objects.filter(site=current_site).aggregate(current_id=Max("id_on_site"))['current_id']+1
The problem is, that the operation of generating the ID and creating the Item is not atomic, so there is a race condition which creates duplicate IDs, so .get(site=current_site, id_on_site=someid) will raise a MultipleObjectsReturned exception.
Using unique_together in the model does not help with the generation of the auto increment ID and doesn't seem to be implemented at the DB level at all.
unique_together is definitely implemented in the database, but since it generates a unique database index, you may need to run a migration to see its effects.
If all you want is for id_on_site to be some unique identifier for the item at the site, it might be easier to use something like a UUIDField(default=uuid.uuid4), which has a near-certain guarantee of uniqueness. If you need the ID to be an auto-incrementing integer, that's a bit harder.
One option to avoid the race condition would be to lock the row with the highest id_on_site value. (This will only work on certain database backends, e.g. postgres):
from django.db.transaction import atomic
with atomic():
next_id_on_site = (
Item.objects
.filter(site=current_site)
.select_for_update(nowait=False)
.latest('id_on_site').id_on_site)
Item.objects.create(current_site, next_id_on_site)
This should cause other transactions to block if they're trying to get the highest id_on_site, and once the transaction is committed, the item you just inserted will be returned to the other transaction. This could be problematic if the transaction is long-lived for some reason.

How to get the value of the counter of an AutoField?

How to get the value of the counter of an AutoField, such as the usual id field of most models?
At the moment, I do:
MyModel.objects.latest('id').id
But that does not work when all the objects have been deleted from the database.
Of course, a database-agnostic answer would be best.
EDIT
The accepted answer in Model next available primary key is not very relevant to my question, as I do not intend to use the counter value to create a new object. Also I don't mind if the value I get is not super accurate.
Background.
AFAIK there isn't a database agnostic query. Different databases handle auto increment differently and there rarely is a use case for django to find out what the next possible auto increment ID is.
To elaborate further, in postgresql you could do select nextval('my_sequence') while in mysql you would need to use the last_insert_id() but what this returns is the ID for the last insert and not the next one these two may actually be very different! To get the actual value you would need to use 'SHOW TABLE STATUS'
Solution.
Create a record, save it, inspect it's ID and delete it.
This will change the next id but you have indicated that you need only an approximation.
The alternative is to do a manual transaction with a rollback. This too would alter the next id in case of mysql.
from django.db import transaction
#transaction.atomic
def find_next_val(mymodel):
try:
# ...
obj = mymoel.objects.create(....)
print obj.id
raise IntegrityError
except IntegrityError:
pass

How to 'bulk update' with Django?

I'd like to update a table with Django - something like this in raw SQL:
update tbl_name set name = 'foo' where name = 'bar'
My first result is something like this - but that's nasty, isn't it?
list = ModelClass.objects.filter(name = 'bar')
for obj in list:
obj.name = 'foo'
obj.save()
Is there a more elegant way?
Update:
Django 2.2 version now has a bulk_update.
Old answer:
Refer to the following django documentation section
Updating multiple objects at once
In short you should be able to use:
ModelClass.objects.filter(name='bar').update(name="foo")
You can also use F objects to do things like incrementing rows:
from django.db.models import F
Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)
See the documentation.
However, note that:
This won't use ModelClass.save method (so if you have some logic inside it won't be triggered).
No django signals will be emitted.
You can't perform an .update() on a sliced QuerySet, it must be on an original QuerySet so you'll need to lean on the .filter() and .exclude() methods.
Consider using django-bulk-update found here on GitHub.
Install: pip install django-bulk-update
Implement: (code taken directly from projects ReadMe file)
from bulk_update.helper import bulk_update
random_names = ['Walter', 'The Dude', 'Donny', 'Jesus']
people = Person.objects.all()
for person in people:
r = random.randrange(4)
person.name = random_names[r]
bulk_update(people) # updates all columns using the default db
Update: As Marc points out in the comments this is not suitable for updating thousands of rows at once. Though it is suitable for smaller batches 10's to 100's. The size of the batch that is right for you depends on your CPU and query complexity. This tool is more like a wheel barrow than a dump truck.
Django 2.2 version now has a bulk_update method (release notes).
https://docs.djangoproject.com/en/stable/ref/models/querysets/#bulk-update
Example:
# get a pk: record dictionary of existing records
updates = YourModel.objects.filter(...).in_bulk()
....
# do something with the updates dict
....
if hasattr(YourModel.objects, 'bulk_update') and updates:
# Use the new method
YourModel.objects.bulk_update(updates.values(), [list the fields to update], batch_size=100)
else:
# The old & slow way
with transaction.atomic():
for obj in updates.values():
obj.save(update_fields=[list the fields to update])
If you want to set the same value on a collection of rows, you can use the update() method combined with any query term to update all rows in one query:
some_list = ModelClass.objects.filter(some condition).values('id')
ModelClass.objects.filter(pk__in=some_list).update(foo=bar)
If you want to update a collection of rows with different values depending on some condition, you can in best case batch the updates according to values. Let's say you have 1000 rows where you want to set a column to one of X values, then you could prepare the batches beforehand and then only run X update-queries (each essentially having the form of the first example above) + the initial SELECT-query.
If every row requires a unique value there is no way to avoid one query per update. Perhaps look into other architectures like CQRS/Event sourcing if you need performance in this latter case.
Here is a useful content which i found in internet regarding the above question
https://www.sankalpjonna.com/learn-django/running-a-bulk-update-with-django
The inefficient way
model_qs= ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
obj.name = 'foo'
obj.save()
The efficient way
ModelClass.objects.filter(name = 'bar').update(name="foo") # for single value 'foo' or add loop
Using bulk_update
update_list = []
model_qs= ModelClass.objects.filter(name = 'bar')
for model_obj in model_qs:
model_obj.name = "foo" # Or what ever the value is for simplicty im providing foo only
update_list.append(model_obj)
ModelClass.objects.bulk_update(update_list,['name'])
Using an atomic transaction
from django.db import transaction
with transaction.atomic():
model_qs = ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
ModelClass.objects.filter(name = 'bar').update(name="foo")
Any Up Votes ? Thanks in advance : Thank you for keep an attention ;)
To update with same value we can simply use this
ModelClass.objects.filter(name = 'bar').update(name='foo')
To update with different values
ob_list = ModelClass.objects.filter(name = 'bar')
obj_to_be_update = []
for obj in obj_list:
obj.name = "Dear "+obj.name
obj_to_be_update.append(obj)
ModelClass.objects.bulk_update(obj_to_be_update, ['name'], batch_size=1000)
It won't trigger save signal every time instead we keep all the objects to be updated on the list and trigger update signal at once.
IT returns number of objects are updated in table.
update_counts = ModelClass.objects.filter(name='bar').update(name="foo")
You can refer this link to get more information on bulk update and create.
Bulk update and Create

Django bulk_create with ignore rows that cause IntegrityError?

I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was wondering if there was a way to tell django to ignore such rows and save as much of the batch as possible?
This is now possible on Django 2.2
Django 2.2 adds a new ignore_conflicts option to the bulk_create method, from the documentation:
On databases that support it (all except PostgreSQL < 9.5 and Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. Enabling this parameter disables setting the primary key on each model instance (if the database normally supports it).
Example:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
], ignore_conflicts=True)
One quick-and-dirty workaround for this that doesn't involve manual SQL and temporary tables is to just attempt to bulk insert the data. If it fails, revert to serial insertion.
objs = [(Event), (Event), (Event)...]
try:
Event.objects.bulk_create(objs)
except IntegrityError:
for obj in objs:
try:
obj.save()
except IntegrityError:
continue
If you have lots and lots of errors this may not be so efficient (you'll spend more time serially inserting than doing so in bulk), but I'm working through a high-cardinality dataset with few duplicates so this solves most of my problems.
(Note: I don't use Django, so there may be more suitable framework-specific answers)
It is not possible for Django to do this by simply ignoring INSERT failures because PostgreSQL aborts the whole transaction on the first error.
Django would need one of these approaches:
INSERT each row in a separate transaction and ignore errors (very slow);
Create a SAVEPOINT before each insert (can have scaling problems);
Use a procedure or query to insert only if the row doesn't already exist (complicated and slow); or
Bulk-insert or (better) COPY the data into a TEMPORARY table, then merge that into the main table server-side.
The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.
Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGED or TEMPORARY, then I'd run some manual SQL to:
LOCK TABLE realtable IN EXCLUSIVE MODE;
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);
The LOCK TABLE ... IN EXCLUSIVE MODE prevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does not prevent concurrent SELECTs, only SELECT ... FOR UPDATE, INSERT,UPDATE and DELETE, so reads from the table carry on as normal.
If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptable into realtable, retrying each block if it failed.
Or 5. Divide and conquer
I didn't test or benchmark this thoroughly, but it performs pretty well for me. YMMV, depending in particular on how many errors you expect to get in a bulk operation.
def psql_copy(records):
count = len(records)
if count < 1:
return True
try:
pg.copy_bin_values(records)
return True
except IntegrityError:
if count == 1:
# found culprit!
msg = "Integrity error copying record:\n%r"
logger.error(msg % records[0], exc_info=True)
return False
finally:
connection.commit()
# There was an integrity error but we had more than one record.
# Divide and conquer.
mid = count / 2
return psql_copy(records[:mid]) and psql_copy(records[mid:])
# or just return False
Even in Django 1.11 there is no way to do this. I found a better option than using Raw SQL. It using djnago-query-builder. It has an upsert method
from querybuilder.query import Query
q = Query().from_table(YourModel)
# replace with your real objects
rows = [YourModel() for i in range(10)]
q.upsert(rows, ['unique_fld1', 'unique_fld2'], ['fld1_to_update', 'fld2_to_update'])
Note: The library only support postgreSQL
Here is a gist that I use for bulk insert that supports ignoring IntegrityErrors and returns the records inserted.
Late answer for pre Django 2.2 projects :
I ran into this situation recently and I found my way out with a seconder list array for check the uniqueness.
In my case, the model has that unique together check, and bulk create is throwing Integrity Error exception because of the array of bulk create has duplicate data in it.
So I decided to create checklist besides bulk create objects list. Here is the sample code; The unique keys are owner and brand, and in this example owner is an user object instance and brand is a string instance:
create_list = []
create_list_check = []
for brand in brands:
if (owner.id, brand) not in create_list_check:
create_list_check.append((owner.id, brand))
create_list.append(ProductBrand(owner=owner, name=brand))
if create_list:
ProductBrand.objects.bulk_create(create_list)
it's work for me
i am use this this funtion in thread.
my csv file contains 120907 no of rows.
def products_create():
full_path = os.path.join(settings.MEDIA_ROOT,'productcsv')
filename = os.listdir(full_path)[0]
logger.debug(filename)
logger.debug(len(Product.objects.all()))
if len(Product.objects.all()) > 0:
logger.debug("Products Data Erasing")
Product.objects.all().delete()
logger.debug("Products Erasing Done")
csvfile = os.path.join(full_path,filename)
csv_df = pd.read_csv(csvfile,sep=',')
csv_df['HSN Code'] = csv_df['HSN Code'].fillna(0)
row_iter = csv_df.iterrows()
logger.debug(row_iter)
logger.debug("New Products Creating")
for index, row in row_iter:
Product.objects.create(part_number = row[0],
part_description = row[1],
mrp = row[2],
hsn_code = row[3],
gst = row[4],
)
# products_list = [
# Product(
# part_number = row[0] ,
# part_description = row[1],
# mrp = row[2],
# hsn_code = row[3],
# gst = row[4],
# )
# for index, row in row_iter
# ]
# logger.debug(products_list)
# Product.objects.bulk_create(products_list)
logger.debug("Products uploading done")```

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?
The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.
If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.
It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.
Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.
django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()
Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)
Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)
Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.
The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()