I have a django application with a model. I need to update the model objects according to ascending id. Something like this.
MyModel.objects.filter(my_field='value').order_by("id").update(another_field='anotheValue')
Can i be assure that the update will happen by the order according to the order by?
Since you already ordered the query set using id, it should update sequentially.
From the docs,
Using update() also prevents a race condition wherein something might
change in your database in the short period of time between loading
the object and calling save().
So I think the ordering might not matter. However to be perfectly sure, you can iterate over the items and update them:
rows = MyModel.objects.filter(my_field='value').order_by("id")
for row in rows:
row.another_field = 'another value'
row.save()
However, the update method would generate one SQL query, where as the iteration would generate many. So this might be an issue with performance.
Related
In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()
I have a "index" field stores a consequent range of integers, for safety i set it unique.
now I want to increase this field by one, to keep unique I update the value in a descending order:
MyModel.objects.all().order_by('-index').update(index=F('index')+1)
what surprises me is that on some machine an IntegrityError
gets raised and complains for duplicated index value.
is there anything i missed? could I only save records one by one?
thanks in advance!
UPDATE:
I think the root problem is that there is no ORDER BY in an SQL UPDATE command (see UPDATE with ORDER BY, and also SQL Server: UPDATE a table by using ORDER BY)
Obviously django simply translates my statement into a SQL UPDATE with ORDER_BY, which leads to an undefined behavior and creates different result per machine.
You are ordering the queryset ascending. You can make it descending by adding the '-' to the field name in the order by:
MyModel.objects.all().order_by('-index').update(index=F('index')+1)
Django docs for order_by
I have a Qt QAbstractItemModel, and the underlying information is inside an sqlite database. I want to incrementally add rows from a database query to the model as they are needed. The database fetch may be slow, however. My problem is that beginInsertRows() needs to be called before the model is modified, but it needs to know how many rows will be added. I won't know that until after I do the query. This means I seem to have the following alternatives, all of which are unattractive.
Do two database queries: "SELECT COUNT(*) …" to get the number of result rows, then call beginInsertRows(), and do the real query. The downside, of course, is that a potentially expensive query has to be done twice.
Do my entire query, buffer the results, count the rows, then call beginInsertRows() and insert them into the model. The downside is all the extra buffering.
Call beginInsertRows()/endInsertRows() once for each row in the result set. This is going to cause a whole bunch of unnecessary view updates.
This seems like a general problem to me. Is there a general solution? For instance, is there a way to tell beginInsertRows() one thing and then change your mind?
Thanks.
In the process of optimizing queries in my app I noticed something strange. In a given section of code I would get the object, make update some values and then save. In theory this should execute 2 queries. But in fact its executing 3 queries. 1 select query when I get the object and 2 when I save the object (Another select and then the update!). While removing one query may seem silly. In this particular method I am updating many objects so every query I save is 1 less hit on the db and should speed up the method.
Through inspection of the queries the two select queries are different the first gets many things and the select executed by the same is simple.
Here is the example code:
myobject = room.myobjects.get(id=myobject_id) # one query executed here
myobject.color = color
myobject.shape = shape
myobject.place = place
myobject.save() # two queries executed here
queries:
1) "SELECT `rooms_object`.`id`, `rooms_object`.`room_id`, ......FROM `rooms_object` WHERE (`rooms_object`.`id` = %s AND `rooms_object`.`room_id` = %s )"
2) "SELECT (1) AS `a` FROM `rooms_object` WHERE `rooms_object`.`id` = %s LIMIT 1"
3) "UPDATE ......this ones obvious"
I want the save method to recognize it already has the object in memory and it does not need to get it again....if that is even possible...
The second query is not actually pulling down the object again. It is doing an extremely fast "existence" check on the id before performing an UPDATE query. All that is returned from that query is a single 1, and the field is indexed, so it should be extremely efficient.
The reason they have chosen to design the ORM this way, is first they look at your object to see if it currently has an ID. If it does, they do the SELECT to make sure it really does still exist in the database. If it does, they perform the update. If somehow the record does not exist, they perform an INSERT. You can test this by creating the object, then deleting the row manually from your database, without django knowing. Then call save()
This is how it works to make sure django maintains consistency.
If it were a new object, you would only get a single INSERT query, because it knows the object has no id right now.
This is managed with force_update parameter in
Model.save([force_insert=False, force_update=False, using=DEFAULT_DB_ALIAS, update_fields=None])
Set force_update to True to disable existence checking ("SELECT (1) AS a FROM...").
https://docs.djangoproject.com/en/dev/ref/models/instances/
I have two threads, one which runs something like update t set ColA=foo and the other runs update t set ColB=foo. If they were doing raw SQL statements there would be no contention, but since Django gets and saves the entire row, a race condition can occur.
Is there any way to tell Django that I just want to save a certain column?
Update old topic.
Now, we have update_fields argument with save:
If save() is passed a list of field names in keyword argument
update_fields, only the fields named in that list will be updated.
https://docs.djangoproject.com/en/stable/ref/models/instances/#specifying-which-fields-to-save
product.name = 'Name changed again'
product.save(update_fields=['name'])
You are correct that save will update the entire row but Django has an update which does exactly what you describe.
https://docs.djangoproject.com/en/stable/ref/models/querysets/#update
I think your only option to guarantee this is to write the raw SQL by hand by using Manager.raw() or a cursor depending on which one is more appropriate.