Model / View: Filtering data beforehand in database or at runtime in proxy model? - c++

Imagine an applications that displays data from a sqlite database.
The app is making use of model/view programming.
It can have multiple views acting in parallel on different subsets of the same data (subsets made by filtering the required data types).
(Sidenote: I am using Qt, so there is no controller part, of course, but I did not find a more suitable tag.)
I am not sure which approach to take:
1a. Load all database data into one single model
1b. Then apply the model to all views, filtering the data inside the view with a proxy model
2a. One model for each view, but filtering done inside sqlite database.
Pros/Cons:
Idea 1:
(+) one model, makes use of model/view advantages (e.g. updating all connected views)
(-) memory usage could get huge because all data is loaded into a model, but only a subset is shown
Idea 2:
(+) theoeretically lower memory usage because only the filtered data is loaded from the database
(-) the views can have filters that could lead to intersecting data, meaning the same data would be stored in more than one model -> perhaps practically even bigger memory usage than in Idea 1
The data being loaded here is just case metadata, e.g. title, description, datetime and so on. Bigger data like images, files are not being loaded here. So as the database could indeed grow big (big for this kind of application, say 200 gb for power users), this does not affect the topic of the present question, because the metadata is much, much smaller and is proportional to overall data count, not data size.
Do you have practical experience with such a configuration and can suggest which one to use? It seems to me that Idea 1 is the way to go, but I am not sure about it.

In my experience, the less data is loaded from the database into memory, the better. It is not just the memory usage, but also startup time. If the data is delivered over the network, loading a few gigabytes can take forever.
So I would go for a variant of your second solution, where each table view has its own model. The model is an implementation of QAbstractItemModel that lazily fetches only the rows that currently need to be displayed. The models could, however, share a common cache. This will also make sure that they all display the same data where it intersects.

Related

Alternatives to dynamically creating model fields

I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.

Selecting a random row in Django, quickly

I have a view which returns data associated with a randomly-chosen row from one of my models. I'm aware of order_by('?') and its performance problems, and I want to avoid using order_by('?') in my view.
Because the data in my model changes very rarely (if at all), I'm considering the approach of caching the entire model in memory between requests. I know how many records I'm dealing with, and I'm comfortable taking the memory hit. If the model does change somehow, I could regenerate the cache at that moment.
Is my strategy reasonable? If so, how do I implement it? If not, how can I quickly select a random row from a model that changes very rarely if at all?
If you know the ids of your object, and its range you can randomize over the ids, and then query the database
A better approach might be to keep the number of objects in your cache, and simply retrieve a random one when you need it:
item_number = random.randint(MODEL_COUNT)
MyModel.objects.all()[item_number]

Ember index data -vs- show data

How do people deal with index data (the data usually shown on index pages, like a customer list) -vs- the model detail data?
When somebody goes to the customer/index route -- they only need access to a small subset of the full customer resource. Since I am dealing with legacy data, my customer model has > 10 relationships. It seems wasteful to have the api return a complete and full customer representation for every customer just to render a list/select/index view.
I know those relationships are somewhat lazy-loaded, but it still takes effort on the backend to pull all those relationships in. For some relationships (such as customer->invoices) this could be a large list of ids.
I feel answers to this can be very opinionated. But my two cents:
The API you are drawing on for your data should have an end-point to fetch the subset of data you're interested in, e.g. /api/mini-customer vs /api/customer.
You can then either define two separate models (one to represent the model in the list and one to represent the detailed view), or simply populate the original model with the subset of data and merge the extra data in at a later point.
That said, I've also seen plenty of cases such as the one you describe, where you load all data initially and just display the subset to begin with. If it's reasonable that the data will eventually be used and your page-load constraints can handle it, then this can be an acceptable approach.

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Design pattern for caching dynamic user content (in django)

On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.