Adding data to database in django issue - django

I have code like this:
form = TestForm(request.POST)
form.save(commit=False).save()
This code sometimes work sometimes dont. Problem is in auto increment id.
When i have some data in db that is not written by django and i want to add data from django i get IntegrityError id already exists.
I i have 2 rows in db(not added by django) i need to click "add data" 3 times. After third time when id increment to 3 all is ok.
How to solve this?

These integrity errors appear, when your table sequence is not updated after new item is created. Or if sequence is out of sync with reality. For example - you import items from some source and the items also contain id, which is higher than your table index sequence indicates. I have not seen a case where django messes sequences up.
So what i guess happens is, that the other source that inserts data into your database, also inserts id's and sequence is not updated. Fix that and your problems should disappear.

Related

Does transaction.atomic roll back increments to a pk sequence

I'm using Django 2.2 and my question is: does transaction.atomic roll back increments to a pk sequence?
Below is the background bug I wrote up that led me to this issue
I'm facing a really weird issue that I can't figure out and I'm hoping someone has faced a similar issue.
An insert using the django ORM .create() function is returning django.db.utils.IntegrityError: duplicate key value violates unique constraint "my_table_pkey" DETAIL: Key (id)=(5795) already exists.
Fine. But then I look at the table and no record with id=5795 exists!
SELECT * from my_table where id=5795;
shows (0 rows)
A look at the sequence my_table_id_seq shows that it has nonetheless incremented to show last_value = 5795 as if the above record was inserted. Moreover the issue does not always occur. A successful insert with different data is inserted at id=5796. (I tried reset the pk sequence but that didn't do anything, since it doesnt seem to be the problem anyway)
I'm quite stumped by this and it has caused us a lot of issues on one specific table. Finally I realize the call is wrapped in transaction.atomic and that a particular scenario may be causing a double insert with the same pk.
So my theory is: The transaction atomic is not rolling back the increment of the
Postgres sequences do not roll back. Every time they are touched by a statement they advance whether the statement succeeds or not. For more information see Notes section here Create Sequence.

Performance optimization on Django update or create

In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()

Occasional IntegrityError on m2m fields using PostgreSQL

I can't detect any pattern, maybe 1 in each 1000 edits of a certain model returns an IntegrityError on a m2m field. Most of the times this field wasn't even modified. When a model is saved I believe django always wipes the m2m field and then re-adds the items, right? I saw django calls clear() and then add()s the items.
My code then fails with:
IntegrityError: duplicate key value violates unique constraint
"app_model_m2m_field_key" DETAIL: Key (model1_id, model2_id)=(597,
1009) already exists.
It seems like the add of items is performed before the items are cleared, which is very weird. I've tried to reproduce it but it's very hard, only happens occasionally. Any idea what could cause it? Could maybe setting auto commit solve this problem?
Thanks in advance
Most likely, you have two requests racing to commit similar changes at the same time.
Request 1 begins a transaction and DELETEs the existing M2M rows.
Request 2 begins a transaction and DELETEs the M2M rows with the same where clause. This blocks waiting for request 1's transaction to commit.
Request 1 re-INSERTs all the M2M rows and commits.
Request 2 resumes, and the delete succeeds without deleting any rows, because all rows that existed when the statement began have already been deleted.
Request 2 tries to re-INSERT an M2M row, but the database detects that it already exists and returns an error.
It's possible to fix this by upgrading to the SERIALIZABLE isolation level (instead of PostgreSQL's default of READ COMMITTED) but at the cost of even more exciting potential failure modes and worse performance.
I'm assuming you're right that Django is performing a DELETE followed by a series of INSERTs, although that wouldn't be a very good plan precisely because it exacerbates this kind of race.
The best plan is to identify what has actually changed and only ask the database to make those changes, because then if you get an integrity error it's because there was a real conflict that you probably couldn't do anything about anyway.

Looping through data over multiple pages in Django

I'm trying to find the best way to go about my problem and I would love your input. I am trying to allow users to scan multiple barcodes into a text area. After they are submitted they are split into an array. The user then inputs how many iterations of each value in the array are to be inserted into a MySQL database. I've achieved this using PHP and session variables, looping through the array one step at a time. With Django I've found it a little more difficult and I am wondering if I should just have a "temporary" table in my database that gets refilled with the values from the array of barcodes. The following pages then pull each value from the table instead of using any sort of session variables.
Edit:
I apologize for the confusing question. Let me try and clear it up a bit:
I need to render a view based on each value in the user-submitted array. When it is first submitted, a view is rendered for the first value. When the user hits "Next" a view will be rendered for the second value in the array, and so on.
As for the database issue, each value can have two "types." The user will declare how many of each type is added to the database in each of the views I am trying to render.
Thank you.
this is nothing about django.
forget that temporary table.
add a field "filled" to ur table
select 1st not-filled row, and show "refill" page by this row
then update user input number back to db, set "filled" to "true" at same time.
You probably can port your PHP solution using a Django session object.
I'm not sure if that "one item at a time" is a feature or a "it was easier to code that way" thing, but in the second case - you may want to use Django Formsets to display all items at once and avoid looping through the array.

Django ORM misreading PostgreSQL sequences?

Background: Running a PostgreSQL database for a Django app (Django 1.1.1, Python2.4, psycopg2 and Postgres 8.1) I've restored the database from a SQL dump several times. Each time I do that and then try to add a new row, either shell, admin, or site front end, I get this error:
IntegrityError: duplicate key violates unique constraint "app_model_pkey"
The data dump is fine and is resetting the sequences. But if I try adding the row again, it's successful! So I can just try jamming a new row into every table and then everything seems to be copacetic.
Question: Given that (1) the SQL dump is good and Postgres is reading it in correctly (per earlier question), and (2) Django's ORM does not seem to be failing systemically getting next values, what is going on in this specific instance?
Django doesn't hold or directly read the sequence values in any way. I've explained it f.ex. in this question: 2088210/django-object-creation-and-postgres-sequences.
Postgresql does increment the sequence when you try to add a row, even if the result of the operation is not successful (raises a duplicate key error) the sequence incrementation doesn't rollback. So, that's the reason why it works the second time you try adding a row.
I don't know why your sequences are not set properly, could you check what is the sequence value before dump and after restore, and do the same with the max() pk of the table? Maybe it's an 8.1 bug with the restore? I don't know. What I'm sure of: it's not Django's fault.
I am guessing that your sequence is out of date.
You can fix that like this:
select setval('app_model_id_seq', max(id)) from app_model;