When I delete an existing record from database and recreated a exactly same record through Django admin interface, the id value shows in Django admin interface is not continuous. For instance, the record with id value is 6 and the previous one is 5, then I delete the one with id 6. When I recreate it, the id value becomes 7 instead of 6. I think it supposes to be 6. It this a error and how can I fix this issue?
That is the correct behaviour. Primary keys should not get re-used, especially to avoid conflicts when it has been referenced in other tables.
See this SO question for more info about it: How to restore a continuous sequence of IDs as primary keys in a SQL database?
If you really want to reset the Auto Increment of PK, you can do ALTER TABLE tablename AUTO_INCREMENT = 1 (for MySQL). Other DBs may have different limitations like not being able to reset to any value lower than the highest used, even if there are gaps (MySQL:InnoDB).
In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()
I have performance data for a model execution instance that I am trying to store in Django (sqlite backend). There are 435 timers with 11 different attributes per timer resulting in 4,785 unique values that need to be stored per run. I wrote code that generated the models.py code. It gets 1,564 columns in and throws the below error. There are no duplicate columns of that name as I have checked the models.py file. When I switch that line with the next in the models file (and make a new migration) it dies at the same line number but new column name.
First time through:
django.db.utils.OperationalError: duplicate column name: CAM_export_processes
Second time:
django.db.utils.OperationalError: duplicate column name: CAM_export_threads
What limitation am I hitting here?
The default setting for SQLITE_MAX_COLUMN is 2000.
This comes from the SQLite documentation. If you are storing over 1000 columns in a table you are most definitely doing something wrong (tm).
While your math is probably sound, there is a chance django is throwing in some extra columns for its own metadata which is pushing you closer to (or over) the limit.
Seriously reconsider your approach for storing this data.
I have code like this:
form = TestForm(request.POST)
form.save(commit=False).save()
This code sometimes work sometimes dont. Problem is in auto increment id.
When i have some data in db that is not written by django and i want to add data from django i get IntegrityError id already exists.
I i have 2 rows in db(not added by django) i need to click "add data" 3 times. After third time when id increment to 3 all is ok.
How to solve this?
These integrity errors appear, when your table sequence is not updated after new item is created. Or if sequence is out of sync with reality. For example - you import items from some source and the items also contain id, which is higher than your table index sequence indicates. I have not seen a case where django messes sequences up.
So what i guess happens is, that the other source that inserts data into your database, also inserts id's and sequence is not updated. Fix that and your problems should disappear.
what i am asking:
if i need to extract top 10 newest News from db. I found only way
News.objects.all().order_by('-pub_date')[:10]
but is it safe? How works this construction? Will it fetch all News records from database and then will order them and then will take only 10 pieces of them? Or it will optimize query and will take only 10 newest records from database? It is significant since I do have over 1000 News records in database and it will take long time to get them from database and even longer to sort them.
This is safe as QuerySets as lazy. At most ten objects will be fetched in your case as the database query will be optimized to return only ten records from the database.
You can read more about when QuerySets are evaluated and limiting QuerySets (this section deals with slicing QuerySets which is what you are doing).