Occasional IntegrityError on m2m fields using PostgreSQL - django

I can't detect any pattern, maybe 1 in each 1000 edits of a certain model returns an IntegrityError on a m2m field. Most of the times this field wasn't even modified. When a model is saved I believe django always wipes the m2m field and then re-adds the items, right? I saw django calls clear() and then add()s the items.
My code then fails with:
IntegrityError: duplicate key value violates unique constraint
"app_model_m2m_field_key" DETAIL: Key (model1_id, model2_id)=(597,
1009) already exists.
It seems like the add of items is performed before the items are cleared, which is very weird. I've tried to reproduce it but it's very hard, only happens occasionally. Any idea what could cause it? Could maybe setting auto commit solve this problem?
Thanks in advance

Most likely, you have two requests racing to commit similar changes at the same time.
Request 1 begins a transaction and DELETEs the existing M2M rows.
Request 2 begins a transaction and DELETEs the M2M rows with the same where clause. This blocks waiting for request 1's transaction to commit.
Request 1 re-INSERTs all the M2M rows and commits.
Request 2 resumes, and the delete succeeds without deleting any rows, because all rows that existed when the statement began have already been deleted.
Request 2 tries to re-INSERT an M2M row, but the database detects that it already exists and returns an error.
It's possible to fix this by upgrading to the SERIALIZABLE isolation level (instead of PostgreSQL's default of READ COMMITTED) but at the cost of even more exciting potential failure modes and worse performance.
I'm assuming you're right that Django is performing a DELETE followed by a series of INSERTs, although that wouldn't be a very good plan precisely because it exacerbates this kind of race.
The best plan is to identify what has actually changed and only ask the database to make those changes, because then if you get an integrity error it's because there was a real conflict that you probably couldn't do anything about anyway.

Related

Does transaction.atomic roll back increments to a pk sequence

I'm using Django 2.2 and my question is: does transaction.atomic roll back increments to a pk sequence?
Below is the background bug I wrote up that led me to this issue
I'm facing a really weird issue that I can't figure out and I'm hoping someone has faced a similar issue.
An insert using the django ORM .create() function is returning django.db.utils.IntegrityError: duplicate key value violates unique constraint "my_table_pkey" DETAIL: Key (id)=(5795) already exists.
Fine. But then I look at the table and no record with id=5795 exists!
SELECT * from my_table where id=5795;
shows (0 rows)
A look at the sequence my_table_id_seq shows that it has nonetheless incremented to show last_value = 5795 as if the above record was inserted. Moreover the issue does not always occur. A successful insert with different data is inserted at id=5796. (I tried reset the pk sequence but that didn't do anything, since it doesnt seem to be the problem anyway)
I'm quite stumped by this and it has caused us a lot of issues on one specific table. Finally I realize the call is wrapped in transaction.atomic and that a particular scenario may be causing a double insert with the same pk.
So my theory is: The transaction atomic is not rolling back the increment of the
Postgres sequences do not roll back. Every time they are touched by a statement they advance whether the statement succeeds or not. For more information see Notes section here Create Sequence.

DRF - how to 'update' the pk after deleting an object?

a beginner here!
here's how im using url path (from the DRF tutorials):
path('articles/', views.ArticleList.as_view()),
path('articles/<int:pk>/', views.ArticleDetail.as_view())
and i noticed that after deleting an 'Article' (this is my model), the pk stays the same.
an Example:
1st Article pk = 1, 2nd Article pk = 2, 3rd Acrticle pk = 3
after deleting the 2n Arctile im expecting --
1st Article pk = 1, 3rd Artcile pk = 2
yet it remains
3rd Artile pk = 3.
is there a better way to impleten the url, maybe the pk is not the variable im looking for?
or i should update the list somehow?
thnkx
and I noticed that after deleting an Article (this is my model), the pk stays the same.
This is indeed the expected behaviour. Removing an object will not "fill the gap" by shifting all the other primary keys. This would mean that for huge tables, you start updating thousands (if not millions) of records, resulting in a huge amount of disk IO. This would make the update (very) slow.
Furthermore not only the primary keys of the table that stores the records should be updated, but all sorts of foreign keys that refer to these records. This thus means that several tables need to be updated. This results in even more disk IO and furthermore it can result in slowing down a lot of unrelated updates to the database due to locking.
This problem can be even more severe if you are working with a distributed system where you have multiple databases on different servers. Updating these databases atomically is a serious challenge. The CAP theorem [wiki] demonstrates that in case a network partition failure happens, then you either can not guarantee availability or consistency. By updating primary keys, you put more "pressure" on this.
Shifting the primary key is also not a good idea anyway. It would mean that if your REST API for example returns the primary key of an object, then the client that wants to access that object might not be able to access that object, because the primary key changed in between. A primary key thus can be seen as a permanent identifier. It is usually not a good idea to change the token(s) that a client uses to access an object. If you use a primary key, or a slug, you know that if you later refer to the same item, you will again retrieve the same item.
how to 'update' the pk after deleting an object?
Please don't. Sorting elements can be done with a timestamp, but that is something different than having an identifier space that does not contain any gaps. A gap is often not a real problem, so you better do not turn it into a real problem.

Oracle FK constraints are not enforced

I'm using oracle 12c database and want to test out one problem.
When carrying out web service request it returns underlying ORA-02292 error on constraint name (YYY.FK_L_TILSYNSOBJEKT_BEGRENSNING).
Here is SQL of the table with the constraint:
CONSTRAINT "FK_L_TILSYNSOBJEKT_BEGRENSNING" FOREIGN KEY ("BEGRENSNING")
REFERENCES "XXX"."BEGRENSNING" ("IDSTRING") DEFERRABLE INITIALLY DEFERRED ENABLE NOVALIDATE
The problem is, that when I try to delete the row manually with valid IDSTRING (in both tables) from parent table - it successfully does it.
What cause it to behave this way? Is there any other info I should give?
Not sure if it helps someone, since it was fairly stupid mistake, but i'll try to make it useful since people demand answers.
Keyword DEFERRABLE INITIALLY DEFERRED means that constraint is enforced at commit not at query run time as opposed to INITIALLY IMMEDIATE, which does the check right after you issue the query, however this keyword makes database bulk updates a bit slower (since every query in a transaction has to be checked by constraint, meanwhile in initial deference if it turns out there is an issue - whole bulk is rolled back, no additional unnecessary queries are issued and something can be done about it), hence used less often than initial deference.
Error ORA-02292 however is shown only for DELETE statements, knowing that its a bit easier to debug your statements.

Celery/django duplicate key violations

I have a single celery worker with 5 threads. It's scraping websites and saving domains to DB via django's ORM.
Here is roughly how it looks like:
domain_all = list(Domain.objects.all())
needs_domain = set()
for x in dup_free_scrape:
domain = x['domain']
if any(domain.lower() == s.name.lower() for s in domain_all):
x['domainn'] = [o for o in domain_all if domain.lower() == o.name.lower()][0]
else:
print('adding: {}'.format(domain))
needs_domain.add(domain)
create_domains = [Domain(name=b.lower()) for b in needs_domain]
create_domains_ids = Domain.objects.bulk_create(create_domains)
Probably not the best way, but it checks domains in one dict(dup_free_scrape) against all domains already in database.
It can go over hundreds or even thousands before encountering the error, but sometimes it does:
Task keywords.domains.model_work[285c3e74-8e47-4925-9ab6-a99540a24665]
raised unexpected: IntegrityError('duplicate key value violates unique
constraint "keywords_domain_name_key"\nDETAIL: Key
(name)=(domain.com) already exists.\n',)
django.db.utils.IntegrityError: duplicate key value violates unique
constraint "keywords_domain_name_key"
The only reason for this issue I can think of would be: One thread saved domain to DB while another was in the middle of code above?
I can't find any good solutions, but here is and idea(not sure if any good): Wrap whole thing in transaction and if databaise raises error simplty retry(query database for "Domain.objects.all()" again).
If you are creating these records in bulk and multiple threads are at it, it's indeed very likely that IntegrityErrors are caused by different threads inserting the same data. Do you really need multiple threads working on this? If yes you could try:
create_domains = []
create_domain_ids = []
for x in dup_free_scrape:
domain = x['domain']
new_domain, created = Domain.objects.get_or_create(name = domain.lower()
if created:
create_domains.append(domain.lower())
created_domain_ids.append(new_domain.pk)
Note that this is all the code. The select all query which you had right at the start is not needed. Domain.objects.all() is going to be very inefficient because you are reading the entire table there.
Also note that your list comprehension for x['domain'] appeared to be completely redundant.
create_domains and create_domain_ids lists may not be needed unless you want to keep track of what was being created.
Please make sure that you have the proper index on domain name. From get_or_create docs:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.

Adding data to database in django issue

I have code like this:
form = TestForm(request.POST)
form.save(commit=False).save()
This code sometimes work sometimes dont. Problem is in auto increment id.
When i have some data in db that is not written by django and i want to add data from django i get IntegrityError id already exists.
I i have 2 rows in db(not added by django) i need to click "add data" 3 times. After third time when id increment to 3 all is ok.
How to solve this?
These integrity errors appear, when your table sequence is not updated after new item is created. Or if sequence is out of sync with reality. For example - you import items from some source and the items also contain id, which is higher than your table index sequence indicates. I have not seen a case where django messes sequences up.
So what i guess happens is, that the other source that inserts data into your database, also inserts id's and sequence is not updated. Fix that and your problems should disappear.