I have a model "Messages" which I use to store messages throughout the site. These are messages in discussions, private messages and probably chat. They are all stored in one table. I wonder if it will be faster if I spread messages among several models and tables. One for chat, one for discussions and so on.
So should I keep all messages in one table/model or create several identical models/tables?
As long as you have an index on your type column and filter on that, it will be about the same speed. When your table gets really big, just shard on the type column and it will be the same performance as doing multiple tables but your app will just see one big table.
One "Table" will be better for search purposes (you can "search" on all of the messages at once.
However, multiple tables may benefit from speed.
Why not use abstracted classes?
class MessageBase(models.Model):
subject = models.CharField(max_length=255)
test = models.TextField()
class ChatMessage(MessageBase):
pass
This will create 2 tables, with the table for ChatMessage just referring directly to the table for MessageBase. This will give you the best of both worlds. "Search" using MessageBase to get messages for anything, but save, and refer to, all other messages using it's specific model class.
(please note, the python here might be slightly wrong, as it hasn't been tested, but I'm sure you get the idea!)
Related
This question already has answers here:
Why use a 1-to-1 relationship in database design?
(6 answers)
Closed 6 months ago.
I'm in the process of building a web app that takes user input and stores it for retrieval and data manipulation. There are essentially 100-200 static fields that the user needs to input to create the Company model.
I see how I could break the Company model into multiple 1-to-1 Django models that map back the a Company such as:
Company General
Company Notes
Company Finacials
Company Scores
But why would I not create a single Company model with 200 fields?
Are there noticeable performance tradeoffs when trying to load a Query Set?
In my opinion, it would be wise for your codebase to have multiple models related to each other. This will give you better scalability opportunities and easier navigation to your model fields. Also, when you want to make a custom serializer, or custom views that will deal with some of your fields, but not all, it would be ideal to not have to retrieve 100+ fields every time.
Turns out I wasn't asking the right question. This is the questions I was asking. It's more a database question than a Django question I believe: Why use a 1-to-1 relationship in database design?
From the logical standpoint, a 1:1 relationship should always be
merged into a single table.
On the other hand, there may be physical considerations for such
"vertical partitioning" or "row splitting", especially if you know
you'll access some columns more frequently or in different pattern
than the others, for example:
You might want to cluster or partition the two "endpoint" tables of a
1:1 relationship differently. If your DBMS allows it, you might want
to put them on different physical disks (e.g. more
performance-critical on an SSD and the other on a cheap HDD). You have
measured the effect on caching and you want to make sure the "hot"
columns are kept in cache, without "cold" columns "polluting" it. You
need a concurrency behavior (such as locking) that is "narrower" than
the whole row. This is highly DBMS-specific. You need different
security on different columns, but your DBMS does not support
column-level permissions. Triggers are typically table-specific. While
you can theoretically have just one table and have the trigger ignore
the "wrong half" of the row, some databases may impose additional
limits on what a trigger can and cannot do. For example, Oracle
doesn't let you modify the so called "mutating" table from a row-level
trigger - by having separate tables, only one of them may be mutating
so you can still modify the other from your trigger (but there are
other ways to work-around that). Databases are very good at
manipulating the data, so I wouldn't split the table just for the
update performance, unless you have performed the actual benchmarks on
representative amounts of data and concluded the performance
difference is actually there and significant enough (e.g. to offset
the increased need for JOINing).
On the other hand, if you are talking about "1:0 or 1" (and not a true
1:1), this is a different question entirely, deserving a different
answer...
I'm building a Django web application, part of it involves an online ordering system for food. I want to make a "receipt" object to save transactions.
My concern, however, is this - let's say I have an object Receipt that relates to Orders which relate to Items, if the items get edited or change over time, it will make the receipts look different down the line. Is there a way to save these at the moment of a transaction?
I am implementing a "soft deletion" to my models to avoid deletion issues however I don't think this would protect against edits.
The only way I can think of to deal with is to 'materialize' the Receipt. In other words when a receipt is generated use the Order and Items information current at the time and then write the actual values, not the Order/Items id to a receipt table. So for a Items item write out the attributes(description, price, qty.etc) you are interested in recording to the table, instead of just an Items.id that points to a possibly changed value in future.
I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.
Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances
How do people deal with index data (the data usually shown on index pages, like a customer list) -vs- the model detail data?
When somebody goes to the customer/index route -- they only need access to a small subset of the full customer resource. Since I am dealing with legacy data, my customer model has > 10 relationships. It seems wasteful to have the api return a complete and full customer representation for every customer just to render a list/select/index view.
I know those relationships are somewhat lazy-loaded, but it still takes effort on the backend to pull all those relationships in. For some relationships (such as customer->invoices) this could be a large list of ids.
I feel answers to this can be very opinionated. But my two cents:
The API you are drawing on for your data should have an end-point to fetch the subset of data you're interested in, e.g. /api/mini-customer vs /api/customer.
You can then either define two separate models (one to represent the model in the list and one to represent the detailed view), or simply populate the original model with the subset of data and merge the extra data in at a later point.
That said, I've also seen plenty of cases such as the one you describe, where you load all data initially and just display the subset to begin with. If it's reasonable that the data will eventually be used and your page-load constraints can handle it, then this can be an acceptable approach.