Negative impact of a Django model with multiple fields (75+ fields) [duplicate] - django

This question already has answers here:
Why use a 1-to-1 relationship in database design?
(6 answers)
Closed 6 months ago.
I'm in the process of building a web app that takes user input and stores it for retrieval and data manipulation. There are essentially 100-200 static fields that the user needs to input to create the Company model.
I see how I could break the Company model into multiple 1-to-1 Django models that map back the a Company such as:
Company General
Company Notes
Company Finacials
Company Scores
But why would I not create a single Company model with 200 fields?
Are there noticeable performance tradeoffs when trying to load a Query Set?

In my opinion, it would be wise for your codebase to have multiple models related to each other. This will give you better scalability opportunities and easier navigation to your model fields. Also, when you want to make a custom serializer, or custom views that will deal with some of your fields, but not all, it would be ideal to not have to retrieve 100+ fields every time.

Turns out I wasn't asking the right question. This is the questions I was asking. It's more a database question than a Django question I believe: Why use a 1-to-1 relationship in database design?
From the logical standpoint, a 1:1 relationship should always be
merged into a single table.
On the other hand, there may be physical considerations for such
"vertical partitioning" or "row splitting", especially if you know
you'll access some columns more frequently or in different pattern
than the others, for example:
You might want to cluster or partition the two "endpoint" tables of a
1:1 relationship differently. If your DBMS allows it, you might want
to put them on different physical disks (e.g. more
performance-critical on an SSD and the other on a cheap HDD). You have
measured the effect on caching and you want to make sure the "hot"
columns are kept in cache, without "cold" columns "polluting" it. You
need a concurrency behavior (such as locking) that is "narrower" than
the whole row. This is highly DBMS-specific. You need different
security on different columns, but your DBMS does not support
column-level permissions. Triggers are typically table-specific. While
you can theoretically have just one table and have the trigger ignore
the "wrong half" of the row, some databases may impose additional
limits on what a trigger can and cannot do. For example, Oracle
doesn't let you modify the so called "mutating" table from a row-level
trigger - by having separate tables, only one of them may be mutating
so you can still modify the other from your trigger (but there are
other ways to work-around that). Databases are very good at
manipulating the data, so I wouldn't split the table just for the
update performance, unless you have performed the actual benchmarks on
representative amounts of data and concluded the performance
difference is actually there and significant enough (e.g. to offset
the increased need for JOINing).
On the other hand, if you are talking about "1:0 or 1" (and not a true
1:1), this is a different question entirely, deserving a different
answer...

Related

Safely segregating customer data in Spanner

We're exploring options for reliably segregating customer data in Spanner. The most obvious solution is a customer per database, but the 100 database/instance limitation renders that impractical. Past experience leads me to be very suspicious of any plan to add a customer-id field to the primary key of each table, because it's far too easy to screw that up in SQL queries, leading to dangerous data cross-talk.
I'm considering weird solutions like using all 2k tables/instance, and taking the ~32 tables we need per customer and prefixing those. E.g., [cust-id]-Table1, [cust-id]-Table2, etc. At least then the customer segregation logic that needs to be iron-clad can be put in one place that's hard to screw up in queries. But is anyone aware of a less weird approach? E.g., "100" is a suspiciously-non-round number in a technical limitation -- is that adjustable somehow?
Unfortunately, 100 databases/instance is not an adjustable value.
Though, I don't seem to fully understand " very suspicious of any plan to add a customer-id field to the primary key of each table, because it's far too easy to screw that up in SQL queries, leading to dangerous data cross-talk." Are you concerned about query performance, data correctness, code correctness or schema ?
With this schema, ~32 tables per customer will only allow you to store ~6000 customers. Though I would suggest benchmarking with other schema choices Spanner exposes.
Would you be able to provide a high-level schema of these customer tables as well as your query patterns ?
Also, suggest to read into for more ideas that fit your usecase better:
Spanner Schema
Interleaved Tables
Secondary Indexes
SQL Best Practices

AWS DynamoDB Table Design: Store two UserIDs and Details in Table

I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?
In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.
I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.

For a shopping app, what are the pros/cons to building ONE model or TWO models?

Additional context:
User can buy one or more items every time they shop. I'm trying to figure out the pros/ cons of two approaches. I've written out what I think are the Pros of each (no need to call out Cons since a Con of one can be written as the Pro of the other), but I want to get feedback from the community
Approach 1:
Build a single model, e.g., Items, where there is a record for every item in the transaction.
Pros:
Generally simpler, one model is always nice
Aligns well with the fact that items are priced and cancelled/ refunded individually (i.e., there's not really anything discount or fee occurring at the Purchase level that would either 1) not be allocated to individual items or 2) not merit its own model)
Approach 2:
Build two models, e.g., Purchases and Items, where Purchases is a parent record that represents that transaction, and Items are the child records that represents every item bought in that transaction.
Pros:
For the business, I think it's easier in two ways: 1) it's easier to run analytics to figure out for example how many items people want to buy each time they make a purchase transaction (this isn't impossible with Approach 1, but certainly easier with Approach 2), and perhaps most importantly: 2) from a fulfillment perspective, it seems easier to send the fulfillment center one Purchase with many items since the delivery dates will all be the same, rather than a bunch of Items that they then have to aggregate (again it's not impossible with Approach 1, but much easier with Approach 2)
This can get quite complicated, and in the past I've used far more advanced versions of #2. You want to normalise your data as much as possible (lookup database normalisation for further info) to make it easier to run reports, but also to maintain consistency of data and reduce duplication. In some real-world scenarios, it's not always possible to fully normalise, and processing performance considerations also play a part simetimes - if you fully normalise data, you split your data into many small chunks (e.g. rows in tables) but to reconstruct your data you then have to retrieve it from many locations (e.g. multiple database queries) which has a performance hit.
Go with #2, and thoroughly plan how you are going to structure your data before you get too far into coding it. For a well-structured model, it should be reasonably straightforward to expand on the system in future. A flat structure can become a nightmare to maintain.

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Design pattern for caching dynamic user content (in django)

On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.