Is there a reusable convenience function to update a ManyToManyField's value without clearing it first? - django

Looking at the Django source, I see that on assignment to a ManyToManyField, all existing values are removed before the new values are added:
https://github.com/django/django/blob/master/django/db/models/fields/related.py#L776
That naive algorithm will introduce tremendous database churn when my application runs—it updates hundreds of thousands of such relationships at a time.
Since, in my case, most of these updates will be noops (i.e., the current and new values will be identical), I could easily reduce churn by updating the M2M field with an algorithm where I first check which objects need to be added, and then check which need to be removed.
This seems like such a common pattern that I'm wondering if a reusable function to do this already exists?

Related

Undoing cascade deletions in a dense database

I have a fairly large production database system, based on a large hierarchy of nodes each with a 10+ associated models. If someone deletes a node fairly high in the tree, there can be thousands of models deleted and if that deletion was a mistake, restoring them can be very difficult. I'm looking for a way to give me an easy 'undo' option.
I've tried using Django-reversion, but it seems like in order to get the functionality I want (easily reverting a large cascade delete) it needs to store a bunch of information with each revision. When I created initial revisions, the process is less than 10% done and it's already using 8GB in my database, which is not going to work for me.
So, is there a standard solution for this problem? Or a way to customize Django-reversions to fit my use case?
What you're looking for is called a soft delete. Add a column named deleted with a value of false to the table. Now when you want to do a "delete" instead change the column deleted to true. Update all the code not to show the rows marked as deleted (or move the database table and replace it with a view that doesn't show them). Change all the unique constraints to have a filter WHERE deleted = false so you won't have a problem with not being able to add something similar to what user can't see in the system.
As for the cascades you have two options. Either do an ON UPDATE trigger that will update the child rows or add the deleted column to the FK and define it as ON UPDATE CASCADE.
You'll get the whole reverse functionality at a cost of one extra row (and not being able to delete stuff to save space unless you do it manually).

Performance optimization on Django update or create

In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()

Selecting a random row in Django, quickly

I have a view which returns data associated with a randomly-chosen row from one of my models. I'm aware of order_by('?') and its performance problems, and I want to avoid using order_by('?') in my view.
Because the data in my model changes very rarely (if at all), I'm considering the approach of caching the entire model in memory between requests. I know how many records I'm dealing with, and I'm comfortable taking the memory hit. If the model does change somehow, I could regenerate the cache at that moment.
Is my strategy reasonable? If so, how do I implement it? If not, how can I quickly select a random row from a model that changes very rarely if at all?
If you know the ids of your object, and its range you can randomize over the ids, and then query the database
A better approach might be to keep the number of objects in your cache, and simply retrieve a random one when you need it:
item_number = random.randint(MODEL_COUNT)
MyModel.objects.all()[item_number]

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Supporting multi-add/delete (and undo/redo) with a QAbstractItemModel (C++)

Greetings,
I've been writing some nasty code to support the undo/redo of deletion of an arbitrary set of objects from my model. I feel like I'm going about this correctly, as all the other mutators (adding/copy-pasting) are subsets of this functionality.
The code is nastier than it needs to me, mostly because the only way to mutate the model involves calling beginInsertRows/beginRemoveRows and removing the rows in a range (just doing 1 row at a time, no need to optimize "neighbors" into a single call yet)
The problem with beginInsertRows/beginRemoveRows is that removal of a row could affect another QModelIndex (say, one cached in a list). For instance:
ParentObj
->ChildObj1
->ChildObj2
->ChildObj3
Say I select ChildObj1 and ChildObj3 and delete them, if I remove ChildObj1 first I've changed ChildObj3's QModelIndex (row is now different). Similar issues occur if I delete a parent object (but I've fixed this by "pruning" children from the list of objects).
Here are the ways I've thought of working around this interface limitation, but I thought I'd ask for a better one before forging ahead:
Move "backwards", assuming a provided list of QModelIndices is orderered from top to bottom just go from bottom up. This really requires sorting to be reliable, and the sort would probably be something naive and slow (maybe there's a smart way of sorting a collection of QModelIndexes? Or does QItemSelectionModel provide good (ordered) lists?)
Update other QModelIndeces each time an object is removed/added (can't think of a non-naive solution, search the list, get new QModelIndeces where needed)
Since updating the actual data is easy, just update the data and rebuild the model. This seems grotesque, and I can imagine it getting quite slow with large sets of data.
Those are the ideas I've got currently. I'm working on option 1 right now.
Regards,
Dan O
Think of beginRemoveRows/endRemoveRows, etc. as methods to ask the QAbstractItemModel base class to fix up your persistent model indexes for you instead of just a way of updating views, and try not to confuse the QAbstractItemModel base class in its work on those indexes. Check out http://labs.trolltech.com/page/Projects/Itemview/Modeltest to exercise your model and see if you are keeping the QAbstractItemModel base class happy.
Where QPersistentModelIndex does not help is if you want to keep your undo/redo data outside of the model. I built a model that is heavily edited, and I did not want to try keeping everything in the model. I store the undo/redo data on the undo stack. The problem is that if you edit a column, storing the persistent index of that column on the undo stack, and then delete the row holding that column, the column's persistent index becomes invalid.
What I do is keep both a persistent model index, and a "historical" regular QModelIndex. When it's time to undo/redo, I check if the persistent index has become invalid. If it has, I pass the historical QModelIndex to a special method of my model to ask it to recreate the index, based on the row, column, and internalPointer. Since all of my edits are on the undo stack, by the time I've backed up to that column edit on the undo stack, the row is sure to be there in the model. I keep enough state in the internalPointer to recreate the original index.
I would consider using a "all-data" model and a filter proxy model with the data model. Data would only be added to the all-data model, and never removed. That way, you could store your undo/redo information with references to that model. (May I suggest QPersistentModelIndex?). Your data model could also track what should be shown, somehow. The filter model would then only return information for the items that should be shown at a given time.