I am wondering what is the difference in efficiency using JSONFields vs. a pure SQL approach in my Postgres DB.
I now know that I can query JSONFields like this:
MyModel.objects.filter(json__title="My title")
In pure SQL, it would look like this:
MyModel.objects.filter(title="My title")
Are these equal in efficiency?
Having separate columns for each thing is definitely more efficient.
The advantage of a JSONField is flexibility. You can store anything you want in there, and you don't have to change your database schema. But this comes at a cost. If you have a column that is a CharField with max 255 characters for example, then lots of time and effort will have gone into making a database that can optimise for that particular type (likewise for other types). With a JSONField however, it can be literally anything and it becomes very difficult to optimise a query (at the actual database level) for this.
Unless you have a good reason to use a JSON field (namely you need that level of flexibility) it is much much much better to go with separate columns for each of your fields. There are other advantages besides performance as well. You can define defaults, you can know for certain what types different variables are, which will make programming with them a whole heap easier and avoid a load of errors.
Related
In the Django 1.10 documentation for the BinaryField field type, they give a warning about its use:
Abusing BinaryField
Although you might think about storing files in the database, consider that it is bad design in 99% of the cases. This field is not a replacement for proper static files handling.
It does not continue with any justification for this claim. Are there any generalized indicators for what falls in the 99% "bad design" or 1% "not bad design" cases? Does this ring particularly true with Django because it has great static files support?
I consider this premature optimization at best and cargo cult programming at worst.
While it is true that relational database systems aren't optimized for storing large fields (whether binary or text) and some of them treat them specially or at least have some restrictions on their use, most of them handle at least moderately sized binary values (let's say up to a few hundred megabytes) quite well. Storing pictures or PDFs in the database will be less efficient than storing them in the file system, but for 99% of all applications it will be efficient enough.
On the other hand, if you store these files in the file system, you lose several advantages:
Updates will be outside of transactions, so you can't be sure that an update to the file (in the filesystem) and the metadata (in the database) will be atomic.
You lose referential integrity: Your database may refer to files which have been deleted or renamed.
You have two different places where you store your data. This complicates access, backups, etc.
I would try to store all data together which belongs logically together. Usually that means storing everything in the database. If this is not technically possible (e.g. because your files are too big - most RDBMS have a size limit on blobs) or because tests show that it is too slow or otherwise inconvenient, you can always optimize it later.
Django models are an abstraction for relational database. These excel in storing small amount of data with well defined format and relationship. They are optimised for fixed length row and low memory usage.
Is your data fixed length, smaller than 4Kb, and is not meant to be served by a webserver ? You are probably in the 1%.
What is the best way to implement functions while writing an app in django? For example, I'd like to have a function that would read some data from other tables, merge then into the result and update user score based on it.
I'm using postgresql database, so I can implement it as database function and use django to directly call this function.
I could also get all those values in python, implement is as django function.
Since the model is defined in django, I feel like I shouldn't define functions in the database construction but rather implement them in python. Also, if I wanted to recreate the database on another computer, I'd need to hardcode those functions and load them into database in order to do that.
On the other hand, if the database is on another computer such function would need to call database multiple times.
Which is preferred option when implementing an app in django?
Also, how should I handle constraints, that I'd like the fields to have? Overloading the save() function or adding constraints to database fields by hand?
This is a classic problem: do it in the code or do it in the DBMS? For me, the answer comes from asking myself this question: is this logic/functionality intrinsic to the data itself, or is it intrinsic to the application?
If it is intrinsic to the data, then you want to avoid doing it in the application. This is particularly true where more than one app is going to be accessing / modifying the data. In which case you may be implementing the same logic in multiple languages / environments. This is a situation that is ripe with ways to screw up—now or in the future.
On the other hand, if this is just one app's way of thinking about the data, but other apps have different views (pun intended), then do it in the app.
BTW, beware of premature optimization. It is good to be aware of DB accesses and their costs, but unless you are talking big data, or a very time sensitive UI, then machine-time, and to a lesser degree user-time, is less important than your time. Getting v1.0 out the door is often more important. As the inimitable Fred Brooks said, "Plan to throw one away; you will anyhow."
I have an SQL JOIN that does not translate well into the Django ORM syntax and I am wondering which alternative solution is better:
Use a raw sql query, or
Use two querysets and perform the join in memory to annotate one of them.
I can join the two querysets in linear time, so I don't think 2. would be significantly slower. But which one of the approaches is better conceptually?
The primary considerations are performance, readability/clarity of the code, and maintainability.
If performance is not an issue here as you say, then clarity and maintainability are the primary factors. Pure Python/Django code is typically clearer to read and follow its purpose than inlining SQL queries as strings in Python code. In my experience, SQL queries as strings are also harder to maintain, as they will not throw syntax errors if your model changes and the query is no longer valid. The issue will be found at runtime instead.
Ultimately the readability and maintainability issues mainly rely on you and any other maintainers, since it will need to be clear and maintainable to you. In my opinion, the SQL string approach is less clear and more brittle.
I'm writing an app in Django where I'd like to make use of implicit inheritence when using ForeignKeys. As far as I'm concerned the only way to handle this nicely is to use django_polymorphic library (no single table inheritence in Django, WHY OH WHY??).
I'd like to know about the performance implications of this solution. What kind of joins are performed when doing polymorphic queries? Does it have to hit the database multiple times as compared to regular queries (the infamous N+1 queries problem)? The docs warn that "the type of queries that are performed aren't handled efficiently by the modern RDBMs"? However it doesn't really tell what those queries are. Any statistics, experiences would be really helpful.
EDIT:
Is there any way of retrieving a list of objects, each being an instance of its actual class with a constant number of queries ?? I thought this is what the aforementioned library does, however now I got confused and I'm not that certain anymore.
Django-Typed-Models is an alternative to Django-Polymorphic which takes a simple & clean approach to solving the single table inheritance issue. It works off a 'type' attribute which is added to your model. When you save it, the class is persisted into the 'type' attribute. At query time, the attribute is used to set the class of the resulting object.
It does what you expect query-wise (every object returned from a queryset is the downcasted class) without needing special syntax or the scary volume of code associated with Django-Polymorphic. And no extra database queries.
In Django inherited models are internally represented through an OneToOneField. If you are using select_related() in a query Django will follow a one to one relation forwards and backwards to include the referenced table with a join; so you wouldn't need to hit the database twice if you are using select_related.
Ok, I've digged a little bit further and found this nice passage:
https://github.com/bconstantin/django_polymorphic/blob/master/DOCS.rst#performance-considerations
So happily this library does something reasonably sane. That's good to know.
I have a model, say, Item. I want to store arbitrary amount of attributes on it, like title, description, release_date. And i want them to be not just strings but have python type, so string, boolean, datetime etc.
What are my options here? EAV pattern with separate name-value table won't work because of the same DB type across all values. JSONField can probably help, but it doesn't know about datetime, for example. Also i was looking at PickeField, it fits perfectly, but i'm a bit concerned about performance.
You have a couple of options and none of them are great. Some of them have been discussed before on Stack Overflow.
Firstly, as you suggested, you have the entity-attribute-value design pattern.
You can add DB type checking by having a table for VARCHARs and one for INTs and one for BOOLEANs and so on.
EAV makes selects very painful. You have to query a number of tables to actually get an object and if you have to use values from the EAV table in the lookup you will run into performance issues as the size increases.
In general, however, EAV should really only be used for very sparse data where another option simply does not work.
There is a Django package for this on PyPI, but I haven't used it.
I have seen some pretty large scale commercial products that use this approach when a lot of flexibility is absolutely required
A slightly better approach is to have a table whose schema changes and a metadata table that describes that table. For dense data where most items have most of the attributes, this has a lot of advantages over EAV. This approach is sometimes called dynamic tables or dynamic rows.
INSERTs, UPDATEs and DELETEs are much faster since everything is in 1-2 tables
Type checking and potentially constraints can be added
However, this approach leaves a very complex database that can be harder to work with
I don't know any way that Django would use its ORM with this kind of database since your models would be changing on the fly.
You are altering your database with ALTER TABLE on the fly. You better be very careful with your transactions
A good approach if you don't need to perform lookups based on these dynamic attributes is to store dynamic data in a JSONField or better yet a schema validated XMLField. However, lookups will be painful if you have to lookup based on a dynamic attribute that is part of your JSON or XML.
The best approach depends on how sparse your data is and how you'll be looking up that data. Also, a very good question to ask is if you absolutely need this flexibility. I've worked on some projects where we decided we needed EAV but since the project went into production attributes are rarely added and rarely removed so we got all the disadvantages and none of the boons.