Celery/django duplicate key violations

Celery/django duplicate key violations - django

I have a single celery worker with 5 threads. It's scraping websites and saving domains to DB via django's ORM.
Here is roughly how it looks like:
domain_all = list(Domain.objects.all())
needs_domain = set()
for x in dup_free_scrape:
domain = x['domain']
if any(domain.lower() == s.name.lower() for s in domain_all):
x['domainn'] = [o for o in domain_all if domain.lower() == o.name.lower()][0]
else:
print('adding: {}'.format(domain))
needs_domain.add(domain)
create_domains = [Domain(name=b.lower()) for b in needs_domain]
create_domains_ids = Domain.objects.bulk_create(create_domains)
Probably not the best way, but it checks domains in one dict(dup_free_scrape) against all domains already in database.
It can go over hundreds or even thousands before encountering the error, but sometimes it does:
Task keywords.domains.model_work[285c3e74-8e47-4925-9ab6-a99540a24665]
raised unexpected: IntegrityError('duplicate key value violates unique
constraint "keywords_domain_name_key"\nDETAIL: Key
(name)=(domain.com) already exists.\n',)
django.db.utils.IntegrityError: duplicate key value violates unique
constraint "keywords_domain_name_key"
The only reason for this issue I can think of would be: One thread saved domain to DB while another was in the middle of code above?
I can't find any good solutions, but here is and idea(not sure if any good): Wrap whole thing in transaction and if databaise raises error simplty retry(query database for "Domain.objects.all()" again).

If you are creating these records in bulk and multiple threads are at it, it's indeed very likely that IntegrityErrors are caused by different threads inserting the same data. Do you really need multiple threads working on this? If yes you could try:
create_domains = []
create_domain_ids = []
for x in dup_free_scrape:
domain = x['domain']
new_domain, created = Domain.objects.get_or_create(name = domain.lower()
if created:
create_domains.append(domain.lower())
created_domain_ids.append(new_domain.pk)
Note that this is all the code. The select all query which you had right at the start is not needed. Domain.objects.all() is going to be very inefficient because you are reading the entire table there.
Also note that your list comprehension for x['domain'] appeared to be completely redundant.
create_domains and create_domain_ids lists may not be needed unless you want to keep track of what was being created.
Please make sure that you have the proper index on domain name. From get_or_create docs:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.

Related

django select_for_update(nowait=False) in transaction.atomic() does not work as expected

I have a django app that needs to get a unique ID. Many threads run at the same time that need one. I would like the IDs to be sequential. When I need a unique ID I do this:
with transaction.atomic():
max_batch_id = JobStatus.objects.select_for_update(nowait=False).aggregate(Max('batch_id'))
json_dict['batch_id'] = max_batch_id['batch_id__max'] + 1
status_row = JobStatus(**json_dict)
status_row.save()
But multiple jobs are getting the same ID. Why does the code not work as I expect? What is a better way to accomplish what I need? I cannot use the row id as there are many rows that have the same batch_id.

As Django docs says in Docs, you can use F() for avoiding race condition:
The process can be made robust, avoiding a race condition, as well as slightly faster by expressing the update relative to the original field value, rather than as an explicit assignment of a new value. Django provides F expressions for performing this kind of relative update.
So you probably want sth like this:
...
max_batch_id = JobStatus.objects.select_for_update(nowait=False).aggregate(Max(F('batch_id')))
...

Does Model.update method in django locks the table before saving the instances?

I have a scenario in which I need to copy the values of one column into an another column. I am trying to do
Model.objects.select_related('vcsdata').all().update(charging_status_v2=F('charging_status'))
Does using F expression along with the update to copy the values would create any downtime? does it locks the table while performing the operation?
related_question_for_more_context

Short Answer:
No, it doesn't.
The only thing Django does in update process (whether you use F expression or not) is keeping the previous state of your record(s) in case if something goes wrong it can rollback to the previous state.
def update(self, **kwargs):
"""
Update all elements in the current QuerySet, setting all the given
fields to the appropriate values.
"""
self._not_support_combined_queries('update')
assert not self.query.is_sliced, \
"Cannot update a query once a slice has been taken."
self._for_write = True
query = self.query.chain(sql.UpdateQuery)
query.add_update_values(kwargs)
# Clear any annotations so that they won't be present in subqueries.
query.annotations = {}
with transaction.mark_for_rollback_on_error(using=self.db):
rows = query.get_compiler(self.db).execute_sql(CURSOR)
self._result_cache = None
return rows
Basically in the line with transaction.mark_for_rollback_on_error(using=self.db), it keeps the previous state of your record, but it does not lock your table or any kind of partial locks.
For example if you have two simultaneous updates at the same time, (suppose one of them is going to take much longer than the other and also slower one hits your database before faster one) then the faster one is going to hit your database regardless of the slower one and does the operation. Then slower one is going to do some other operation on your table (this example is enough for proving that update does not lock your table).
Also note that calling update for updating multiple objects (if this is a doable thing) is the most efficient way for updating multiple objects as far as I know (comparing to calling save on each instance or bulk update).

DRF - how to 'update' the pk after deleting an object?

a beginner here!
here's how im using url path (from the DRF tutorials):
path('articles/', views.ArticleList.as_view()),
path('articles/<int:pk>/', views.ArticleDetail.as_view())
and i noticed that after deleting an 'Article' (this is my model), the pk stays the same.
an Example:
1st Article pk = 1, 2nd Article pk = 2, 3rd Acrticle pk = 3
after deleting the 2n Arctile im expecting --
1st Article pk = 1, 3rd Artcile pk = 2
yet it remains
3rd Artile pk = 3.
is there a better way to impleten the url, maybe the pk is not the variable im looking for?
or i should update the list somehow?
thnkx

and I noticed that after deleting an Article (this is my model), the pk stays the same.
This is indeed the expected behaviour. Removing an object will not "fill the gap" by shifting all the other primary keys. This would mean that for huge tables, you start updating thousands (if not millions) of records, resulting in a huge amount of disk IO. This would make the update (very) slow.
Furthermore not only the primary keys of the table that stores the records should be updated, but all sorts of foreign keys that refer to these records. This thus means that several tables need to be updated. This results in even more disk IO and furthermore it can result in slowing down a lot of unrelated updates to the database due to locking.
This problem can be even more severe if you are working with a distributed system where you have multiple databases on different servers. Updating these databases atomically is a serious challenge. The CAP theorem [wiki] demonstrates that in case a network partition failure happens, then you either can not guarantee availability or consistency. By updating primary keys, you put more "pressure" on this.
Shifting the primary key is also not a good idea anyway. It would mean that if your REST API for example returns the primary key of an object, then the client that wants to access that object might not be able to access that object, because the primary key changed in between. A primary key thus can be seen as a permanent identifier. It is usually not a good idea to change the token(s) that a client uses to access an object. If you use a primary key, or a slug, you know that if you later refer to the same item, you will again retrieve the same item.
how to 'update' the pk after deleting an object?
Please don't. Sorting elements can be done with a timestamp, but that is something different than having an identifier space that does not contain any gaps. A gap is often not a real problem, so you better do not turn it into a real problem.

Do I need to commit transactions in Django 1.6?

I want to create and object, save it to DB, then check if there is another row on the DB with the same token with execution_time=0. If there is, I want to delete the object created then restart the process.
transfer = Transfer(token = generateToken(size=9))
transfer.save()
while (len(Transfer.objects.filter(token=transfer.token, execution_time=0))!=1):
transfer.delete()
transfer = Transfer(token = generateToken(size=9))
transfer.save()
Do I need to commit the transaction between every loop? For example calling commit() at the end of every loop?
while (len(Transfer.objects.filter(token=transfer.token, execution_time=0))!=1):
transfer.delete()
transfer = Transfer(token = generateToken(size=9))
transfer.save()
commit()
#transaction.commit_manually
def commit():
transaction.commit()

From what you've described I don't think you need to use transactions. You're basically recreating a transaction rollback manually with your code.
I think the best way to handle this would be to have a database constraint enforce the issue. Is it the case that token and execution_time should be unique together? In that case you can define the constraint in Django with unique_together. If the constraint is that token should be unique whenever execution_time is 0, some databases will let you define a constraint like that as well.
If the constraint were in the database you could just do a get_or_create() in a loop until the Transfer was created.
If you can't define the constraint in the database for whatever reason then I think your version would work. (One improvement would be to use .count() instead of len.)

I want to create and object, save it to DB, then check if there is
another row on the DB with the same token with execution_time=0. If
there is, I want to delete the object created then restart the
process.
There are few ways you can approach this, depending on what your end goal is:
Do you want that no other record is written while you are writing yours (to prevent duplicates?) If so, you need to get a lock on your table, and to do that, you need to execute an atomic transaction, with #transaction.atomic (new in 1.6)
If you want to make sure that no duplicate records are created given a combination of fields, you need to enforce this at the database level with unique_together
I believe combining the above two will solve your problem; however, if you want a more brute force approach; you can override the save() method for your object, and then raise an appropriate exception when a record is trying to be created (or updated) that violates your constraints.
In your view, you would then catch this exception and then take the appropriate action.

Is get_or_create() thread safe

I have a Django model that can only be accessed using get_or_create(session=session), where session is a foreign key to another Django model.
Since I am only accessing through get_or_create(), I would imagine that I would only ever have one instance with a key to the session. However, I have found multiple instances with keys to the same session. What is happening? Is this a race condition, or does get_or_create() operate atomically?

NO, get_or_create is not atomic.
It first asks the DB if a satisfying row exists; database returns, python checks results; if it doesn't exist, it creates it. In between the get and the create anything can happen - and a row corresponding to the get criteria be created by some other code.
For instance wrt to your specific issue if two pages are open by the user (or several ajax requests are performed) at the same time this might cause all get to fail, and for all of them to create a new row - with the same session.
It is thus important to only use get_or_create when the duplication issue will be caught by the database through some unique/unique_together, so that even though multiple threads can get to the point of save(), only one will succeed, and the others will raise an IntegrityError that you can catch and deal with.
If you use get_or_create with (a set of) fields that are not unique in the database you will create duplicates in your database, which is rarely what you want.
More in general: do not rely on your application to enforce uniqueness and avoid duplicates in your database! THat's the database job!
(well unless you wrap your critical functions with some OS-valid locks, but I would still suggest to use the database).
With thes warnings, used correctly get_or_create is an easy to read, easy to write construct that perfectly complements the database integrity checks.
Refs and citations:
http://groups.google.com/group/django-developers/browse_thread/thread/905f79e350525c95/0af3a41de4f4ce06
http://groups.google.com/group/django-developers/browse_thread/thread/f0b3381b2620e7db/8eae2f6087e550bb

Actualy it's not thread-safe, you can look at the code of the get_or_create method of the QuerySet object, basicaly what it does is the following :
try:
return self.get(**lookup), False
except self.model.DoesNotExist:
params = dict([(k, v) for k, v in kwargs.items() if '__' not in k])
params.update(defaults)
obj = self.model(**params)
sid = transaction.savepoint(using=self.db)
obj.save(force_insert=True, using=self.db)
transaction.savepoint_commit(sid, using=self.db)
return obj, True
So two threads might figure-out that the instance does not exists in the DB and start creating a new one, before saving them consecutively.

Threading is one problem, but get_or_create is broken for any serious usage in default isolation level of MySQL:
How do I deal with this race condition in django?
Why doesn't this loop display an updated object count every five seconds?
https://code.djangoproject.com/ticket/13906
http://www.no-ack.org/2010/07/mysql-transactions-and-django.html

I was having this problem with a view that calls get_or_create.
I was using Gunicorn with multiple workers, so to test it I changed the number of workers to 1 and this made the problem disappeared.
The simplest solution I found was to lock the table for access. I used this decorator to do the lock per view (for PostgreSQL):
http://www.caktusgroup.com/blog/2009/05/26/explicit-table-locking-with-postgresql-and-django/
EDIT: I wrapped the lock statement in that decorator in a try/except to deal with DB engines with no support for it (SQLite while unit testing in my case):
try:
cursor.execute('LOCK TABLE %s IN %s MODE' % (model._meta.db_table, lock))
except DatabaseError:
pass

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js