I'm writing an app in Django where I'd like to make use of implicit inheritence when using ForeignKeys. As far as I'm concerned the only way to handle this nicely is to use django_polymorphic library (no single table inheritence in Django, WHY OH WHY??).
I'd like to know about the performance implications of this solution. What kind of joins are performed when doing polymorphic queries? Does it have to hit the database multiple times as compared to regular queries (the infamous N+1 queries problem)? The docs warn that "the type of queries that are performed aren't handled efficiently by the modern RDBMs"? However it doesn't really tell what those queries are. Any statistics, experiences would be really helpful.
EDIT:
Is there any way of retrieving a list of objects, each being an instance of its actual class with a constant number of queries ?? I thought this is what the aforementioned library does, however now I got confused and I'm not that certain anymore.
Django-Typed-Models is an alternative to Django-Polymorphic which takes a simple & clean approach to solving the single table inheritance issue. It works off a 'type' attribute which is added to your model. When you save it, the class is persisted into the 'type' attribute. At query time, the attribute is used to set the class of the resulting object.
It does what you expect query-wise (every object returned from a queryset is the downcasted class) without needing special syntax or the scary volume of code associated with Django-Polymorphic. And no extra database queries.
In Django inherited models are internally represented through an OneToOneField. If you are using select_related() in a query Django will follow a one to one relation forwards and backwards to include the referenced table with a join; so you wouldn't need to hit the database twice if you are using select_related.
Ok, I've digged a little bit further and found this nice passage:
https://github.com/bconstantin/django_polymorphic/blob/master/DOCS.rst#performance-considerations
So happily this library does something reasonably sane. That's good to know.
Related
I am wondering what is the difference in efficiency using JSONFields vs. a pure SQL approach in my Postgres DB.
I now know that I can query JSONFields like this:
MyModel.objects.filter(json__title="My title")
In pure SQL, it would look like this:
MyModel.objects.filter(title="My title")
Are these equal in efficiency?
Having separate columns for each thing is definitely more efficient.
The advantage of a JSONField is flexibility. You can store anything you want in there, and you don't have to change your database schema. But this comes at a cost. If you have a column that is a CharField with max 255 characters for example, then lots of time and effort will have gone into making a database that can optimise for that particular type (likewise for other types). With a JSONField however, it can be literally anything and it becomes very difficult to optimise a query (at the actual database level) for this.
Unless you have a good reason to use a JSON field (namely you need that level of flexibility) it is much much much better to go with separate columns for each of your fields. There are other advantages besides performance as well. You can define defaults, you can know for certain what types different variables are, which will make programming with them a whole heap easier and avoid a load of errors.
Is it an issue in any way that a card many attribute may have a huge number of values? (say 10^6)
I think the intuition/concern here is that we definitely don't want equivalent of a select * on the entity creeping in anywhere.
Of course, the datoms making up a card many are clearly stored individually, so I suspect this simply isn't an issue. However I just wanted to be sure.
Coming from sql, the pivot is clearly a separate table and can be treated as such.
So, I guess it depends entirely on the entity/pull apis that actually assemble the datoms, and/or perhaps on datalog itself (Background: I haven't used the datomic apis for real yet, though I have done the excellent learn datalog tutorial - my current task is translating a few sql schemas i have into datomic, to get started).
I'd like to know if that is simply not a problem in any of the pull or entity apis, that such an attribute can be easily excluded from a result set in effectively all cases, or is there any drawback of a highly numerous card many attribute to be aware of?
What is the best way to implement functions while writing an app in django? For example, I'd like to have a function that would read some data from other tables, merge then into the result and update user score based on it.
I'm using postgresql database, so I can implement it as database function and use django to directly call this function.
I could also get all those values in python, implement is as django function.
Since the model is defined in django, I feel like I shouldn't define functions in the database construction but rather implement them in python. Also, if I wanted to recreate the database on another computer, I'd need to hardcode those functions and load them into database in order to do that.
On the other hand, if the database is on another computer such function would need to call database multiple times.
Which is preferred option when implementing an app in django?
Also, how should I handle constraints, that I'd like the fields to have? Overloading the save() function or adding constraints to database fields by hand?
This is a classic problem: do it in the code or do it in the DBMS? For me, the answer comes from asking myself this question: is this logic/functionality intrinsic to the data itself, or is it intrinsic to the application?
If it is intrinsic to the data, then you want to avoid doing it in the application. This is particularly true where more than one app is going to be accessing / modifying the data. In which case you may be implementing the same logic in multiple languages / environments. This is a situation that is ripe with ways to screw up—now or in the future.
On the other hand, if this is just one app's way of thinking about the data, but other apps have different views (pun intended), then do it in the app.
BTW, beware of premature optimization. It is good to be aware of DB accesses and their costs, but unless you are talking big data, or a very time sensitive UI, then machine-time, and to a lesser degree user-time, is less important than your time. Getting v1.0 out the door is often more important. As the inimitable Fred Brooks said, "Plan to throw one away; you will anyhow."
Specifically thinking of web apps,
(1) why are relationships(ie:foreign keys) in RDBMS even useful?
The web apps I write have logic built-in that validates user input against required fields. I see no real use for foreign keys and thus no real use for relational databases.
Besides, if I were to put all the required field validation logic in the RDBMS(ie:MySQL) it would simply return a vague error. At least with PHP-based validation I know which field is missing and I can notify the user(though with Javascript-based validation this would almost NEVER happen anyway).
(2) Was there a point in the past where RDBMS were useful for some reason or is there a reason they are useful now that I'm not aware of?
I really need some insight on this topic. I'm simply can't come up with a good answer.
I will come at this from a different angle.
I work at a place where we had a database that had no foreign key constraints, default values, or other data checks whatsoever in their initial records database. The lead engineer's excuse for this was something similar to what you have described above. "The application will ensure the referential integrity".
The problem is, we did not have a standard data layer (like an object relational mapping) over the top of the database. We had multiple programmatic sources that fed into the same tables. It was funny because after a while, you could tell which parts of the code created which rows in the table. Sometimes the links lined up, sometimes they didn't. Sometimes the links were NULL (when they shouldn't be), and sometimes they were 0. We even had a few cyclic records which was fun.
My point is, you never know when you are going to need to write a quick script to batch import records, or write a new subsystem that references the same tables. It behooves us as programmers to program as defensively as possible. We can't assume that those who come after us will know as much (if anything) about how our schema should be used.
I'm not much of an SQL lover, but even I must say that the relational structure has its advantages.
It doesn't only allow validation. By providing the database with metadata describing the relations between the actual pieces information stored, a great number of optimizations are possible.
This makes it possible to quickly retrieve large, complex datasets. It also reduces the number of queries needed to make modifications and keep the data coherent, since most of the "book-keeping" is carried out automatically on the DB side of the connection.
One incredibly useful feature of foreign keys in most relational databases are cascades.
Suppose you have a families table and a persons table. Each family can have multiple people, but a person can only belong in one family (one-to-many relationship). If you have foreign keys and you delete a family row, the database can automatically update all the related people, either by deleting them or setting their foreign keys to null.
If you do not have this constraint, you must handle this situation yourself, in your own code.
RDBMSs are still very useful. Not sure why you wouldn't think so. Foreign key constraints can be used to maintain referential integrity (in other words, to provide a simple way to express 1:1, 1:many and many:many relationships. RDBMSs are also useful because there was a rich theory accompanying practical developments, unlike previous DBMSs. In particular, relational calculus/algebra are nice since they allow for good query optimization, normalization, etc.
Not sure if that really answers your question. Wikipedia might list some advantages of RDBMSs.
(1) why are relationships(ie:foreign keys) in RDBMS even useful?
First off, I think you are talking about foreign key CONSTRAINTS. Foreign keys are just a logical design feature that says that this entity matches up with that one.
The reason foreign key constraints are useful are:
They help you adhere to the DRY (Don't repeat yourself) principle. Sure your app validates the relationship, but does it do it in several places? Are there multiple apps that access the same DB? Do you have to repeat the logic in each app? Hey, you could pull that logic out and use a common DLL for access to that data that enforces that logic.Better yet, what if that was built into the RDMBS so I didn't have to write custom code to do something so routine? Bam. Foreign key constraints.
If your app enforces the foreign key validations, how do you force users who are working directly in the DB to honor your rules? I know, I know. You shouldn't let users into the back-end directly, but you just try telling that to the data analysts when they have a project for corporate and you are the bottleneck.
As to the vague error. Wouldn't your argument be better stated as RDBMS X has vague errors when data fails foreign key constraint checks? The way you have generalized it, you could also argue that we should use paper ledgers instead of computers because the constraint had a vague error.
(2) Was there a point in the past where RDBMS were useful for some reason or is there a reason they are useful now that I'm not aware of?
Yeah, that would be now, yesterday and probably long into the future.
I could go on forever about the reasons, but here is the big one...
It provides a common structured file format that is easy to extend, leverage by other applications. You may be too young to remember when every dang system had it's own proprietary structured file format, but it sucked. Plus, it forced you re-invent the wheel constantly in terms of things like indexing, a query language, locking, etc.
"I see no real use for foreign keys and thus no real use for
relational databases"
Judging by this remark, you seem to be underestimating what a relational database is for. Foreign key constraints aren't a defining feature of relational databases and certainly aren't the only reason for using such databases. The relational database model is a powerful and effective way to represent data and it remains so even if you decide you don't want to implement a foreign key constraint. I will therefore assume the question you really meant to ask is: Why are foreign keys useful in relational databases?
A foreign key constraint is just one kind of data integrity constraint. You can of course implement integrity rules outside the database but the DBMS is designed and optimised to do the job for you and is generally the most efficient place to do it because it is closest to the data structures. If you did it outside the database then you would have at least an extra round trip to retrieve the necessary data. You would also have to replicate the DBMS's locking/concurrency model in your application code.
The database optimiser can take advantage of constraints in the database to improve the performance of queries. It can't do that if the rules only exist in your application code.
If you have many applications sharing the same database then implementing data integrity rules in every application is impractical and expensive to maintain. Centralising the constraint logic makes more sense.
Various CASE tools and DBA tools will take advantage of database constraints, can reverse engineer them and use them to assist development and maintenance tasks.
In practice the meaning and function of a database constraint versus some procedural code that validates data only on entry is very different. If X is implemented in a database constraint then I know it is valid for every piece of data in the database. If X is implemented in the application when data is entered then I only know it applies to future data - I can't be sure it applies to everything already in the database (maybe X was only implemented today and didn't apply to the data entered yesterday).
Because they maintain the integrity of the database. If you have all your business logic in the application then in theory they are not needed, but are still useful as a safeguard against bad data.
I have a model, say, Item. I want to store arbitrary amount of attributes on it, like title, description, release_date. And i want them to be not just strings but have python type, so string, boolean, datetime etc.
What are my options here? EAV pattern with separate name-value table won't work because of the same DB type across all values. JSONField can probably help, but it doesn't know about datetime, for example. Also i was looking at PickeField, it fits perfectly, but i'm a bit concerned about performance.
You have a couple of options and none of them are great. Some of them have been discussed before on Stack Overflow.
Firstly, as you suggested, you have the entity-attribute-value design pattern.
You can add DB type checking by having a table for VARCHARs and one for INTs and one for BOOLEANs and so on.
EAV makes selects very painful. You have to query a number of tables to actually get an object and if you have to use values from the EAV table in the lookup you will run into performance issues as the size increases.
In general, however, EAV should really only be used for very sparse data where another option simply does not work.
There is a Django package for this on PyPI, but I haven't used it.
I have seen some pretty large scale commercial products that use this approach when a lot of flexibility is absolutely required
A slightly better approach is to have a table whose schema changes and a metadata table that describes that table. For dense data where most items have most of the attributes, this has a lot of advantages over EAV. This approach is sometimes called dynamic tables or dynamic rows.
INSERTs, UPDATEs and DELETEs are much faster since everything is in 1-2 tables
Type checking and potentially constraints can be added
However, this approach leaves a very complex database that can be harder to work with
I don't know any way that Django would use its ORM with this kind of database since your models would be changing on the fly.
You are altering your database with ALTER TABLE on the fly. You better be very careful with your transactions
A good approach if you don't need to perform lookups based on these dynamic attributes is to store dynamic data in a JSONField or better yet a schema validated XMLField. However, lookups will be painful if you have to lookup based on a dynamic attribute that is part of your JSON or XML.
The best approach depends on how sparse your data is and how you'll be looking up that data. Also, a very good question to ask is if you absolutely need this flexibility. I've worked on some projects where we decided we needed EAV but since the project went into production attributes are rarely added and rarely removed so we got all the disadvantages and none of the boons.