For a model in my database I need to store around 300 values for a specific field. What would be the drawbacks, in terms of performance and simplicity in query, if I use Postgres-specific ArrayField instead of a separate table with One-to-Many relationship?
If you use an array field
The size of each row in your DB is going to be a bit large thus Postgres is going to be using a lot more toast tables (http://www.postgresql.org/docs/9.5/static/storage-toast.html)
Every time you get the row, unless you specifically use defer (https://docs.djangoproject.com/en/1.9/ref/models/querysets/#defer) the field or otherwise exclude it from the query via only, or values or something, you paying the cost of loading all those values every time you iterate across that row. If that's what you need then so be it.
Filtering based on values in that array, while possible isn't going to be as nice and the Django ORM doesn't make it as obvious as it does for M2M tables.
If you use M2M
You can filter more easily on those related values
Those fields are postponed by default, you can use prefetch_related if you need them and then get fancy if you want only a subset of those values loaded
Total storage in the DB is going to be slightly higher with M2M because of keys, and extra id fields
The cost of the joins in this case is completely negligible because of keys.
Personally I'd say go with the M2M tables, but I don't know your specific application. If you're going to be working with a massive amount of data it's likely worth grabbing a representative dataset and testing both methods with it.
Related
I have multiple small key/value tables in Django, and there value never change
ie: 1->"Active", 2->"Down", 3->"Running"....
and multiple times, I do some get by id and other time by name.
So I'm asking, if it's not more optimize to move them all as Dict (global or in models) ?
thank you
Generally django querysets are slower than dicts, so if you want to write model with one field that has these statuses (active, down, running) it's generally better to use dict until there is need for editability.
Anyway I don't understand this kind of question, the performance benefits are not really high until you got ~10k+ records in single QS, and even by then you can cast the whole model to list by using .values_list syntax. Execution will take approximately part of second.
Also if I understand, these values should be anyway in models.CharField with choices field set, rather than set up by fixture in models.ForeignKey.
When we should define db_index=True on a model fields ?
I'm trying to optimize the application & I want to learn more about db_index, in which conditions we should use it ?
The documentation says that using db_index=True on model fields is used to speed up the lookups with slightly disadvantages with storage and memory.
Should we use db_index=True only on those fields that have unique values like the primary field id ?
What happens if we enabled indexing for those fields which are not unique and contains repetitive data ?
I would say you should use db_index=True when you have a field that is unique for faster lookups.
For example, if you a table customers with many records of users they'll each have their own unique user_id. When you create an index, a pointer is created to where that data is stored within your database so that the next look up against that column will give you a much more desirable time of query than say using their first_name or last_name.
Have a look here to learn more about indexing
You should use db_index=True when you use unique=True, there is a specific reason to use it,
By using this method you can boost a little bit of performance,
When we fire a query in SQL, finding starts from the top to bottom
Case: 'Without db_index=True':
It will search and filter till all bottom rows even if we find the data
Case: 'With db_index=True':
When Object finds it will just stop their
It will boost a little bit of performance
When you set db_index=True on some field, queries based on that field would be much faster O(log(n)) instead O(n).
Under the hood, it is usually implemented using B-Tree.
The trade-off for these accelerated queries is increased memory usage and time for writes. So the best use case for indexing would be if you have a read-heavy database that is often quired by non-primary field.
I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.
I'm learning SQLAlchemy and using a database where multiple table lookups are needed to find a single piece of data.
I'm trying to find the best (most efficient and Pythonic) way to map the multiple lookups to a single SQLAlchemy object or reusable python method.
Ultimately, there will be dozens if not hundreds of mapped object such as these, so something like a .map file might be handy.
I.e. (Using pseudocode)
If I want to find the data 'Status' from 'Patient Name' have to use three tables.
Instead of writing a function for every potential 'this' to 'that' data request, is there an SQLAlchemy or Pythonic way to make the mappings?
I CAN make new, temporary SQLAlchemy Tables to store data. I am NOT at liberty to change the database I'm reading from. I'm hoping to reduce the number of individual calls to the database, because it is remote and slow.
I'm not sure a data join will work, because the Primary Keys, Foreign Keys and Column names are inconsistent in the database. But, I don't really know how to make select-joins in SQLAlchemy.
Perhaps I need to create a new table, with relationships to those three previous tables? But I'm not understanding the relationships well.
Can these tables be auto-generated from a map.ini file?
EDIT:
I might add, that some of these relationships could be one to many. I.e. a patient may be associated with more than one statusID...such as...
That seems simple enough, but all Django Queries seems to be 'SELECT *'
How do I build a query returning only a subset of fields ?
In Django 1.1 onwards, you can use defer('col1', 'col2') to exclude columns from the query, or only('col1', 'col2') to only get a specific set of columns. See the documentation.
values does something slightly different - it only gets the columns you specify, but it returns a list of dictionaries rather than a set of model instances.
Append a .values("column1", "column2", ...) to your query
The accepted answer advising defer and only which the docs discourage in most cases.
only use defer() when you cannot, at queryset load time, determine if you will need the extra fields or not. If you are frequently loading and using a particular subset of your data, the best choice you can make is to normalize your models and put the non-loaded data into a separate model (and database table). If the columns must stay in the one table for some reason, create a model with Meta.managed = False (see the managed attribute documentation) containing just the fields you normally need to load and use that where you might otherwise call defer(). This makes your code more explicit to the reader, is slightly faster and consumes a little less memory in the Python process.