Generally, we recommend minimizing the use of bi-directional
relationships. They can negatively impact on model query performance,
and possibly deliver confusing experiences for your report users.
Link: https://learn.microsoft.com/en-us/power-bi/guidance/relationships-bidirectional-filtering
Documentation clearly says that bi-directional filtering hurts model performance.
Does many to many relationship also hurt model performance? The documentation (https://learn.microsoft.com/en-us/power-bi/guidance/relationships-many-to-many) doesn't mention this.
The reason for asking this question is - my understanding was that model performance is based on table expansion, and since many to many relationship doesn't support table expansion, does this imply that it will have bad performance?
Whereas bidirectional relationship doesn't affect table expansion (in an intra group 1:n relationship). Yet it is said that bidirectional relationship has bad performance.
So is table expansion not a factor that affects model performance?
There is an extensive article and associated video on many-to-many performance here: https://www.sqlbi.com/articles/different-options-to-model-many-to-many-relationships-in-power-bi-and-tabular/
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Considering two approaches in designing Django models:
1 - Very few models, but each model has a long list of fields.
2 - Many models, each model has significantly shorter list of fields. But the models employ multi-level hierarchical one-to-many structure.
Suppose they both use Postgres as the database. In general, what would be the comparative database size, latency in retrieving, processing and updating data for these two approaches?
In short: define models based on the business logic.
People often aim to optimize too early in the development process. As Donald Knuth said:
Premature optimization is the root of all evil.
You should see tables as a storage device for entities. For example if you are making an e-commerce website. It makes sense that there is a model for Products, a model for Order, and a model in between (the junction table of a many-to-many relation between Product and Order) that determines how many times the product appears in the order.
By modeling this based on data, and not on a specific use case, it is usually simpler to add new features to your application. Furthermore it makes querying simpler, and therefore often has a positive effect on the overall performance.
Furthermore it is of importance to get used to the Django ORM tooling, and especially, as #markwalker_ says, with select_related(…) and prefetch_related(…) method calls to load in bulk data that is related to the data you are retrieving. Often the number of queries to the database is already a strong indicator how efficient that program will run, not that much the exact queries: if your application makes a lot of queries, even simple ones, the number of roundtrips to the database will slow down the application significantly. If there is a bottleneck somewhere, then you can run a profiler and try to find parts of the code that needs to be optimized.
There is for example a package named nplusone [GitHub] and scout can detect N+1 query problems that thus can be resolved with select_related(…) and prefetch_related(…).
When talking about relational databases, it seems that most people refer to the primary and foreign key 'relations' as the reason for the 'relational database' terminology.
This is causing me considerable confusion because the textbook linked below states explicitly "A common misconception is that the name "relational" has to do with relationships between tables (that is, foreign keys). Actually, the true source for the model's name is the mathematical concept relation. A relation in the relational model is what SQL calls a table."
http://www.valorebooks.com/textbooks/training-kit-exam-70-461-querying-microsoft-sql-server-2012-microsoft-press-training-kit-1st-edition/9780735666054#default=buy&utm_source=Froogle&utm_medium=referral&utm_campaign=Froogle&date=11/12/15
Furthermore the next source explicitly refers to the tables as the relations and not the primary/foreign keys.
https://docs.oracle.com/javase/tutorial/jdbc/overview/database.html
However it seems common knowledge almost anywhere else I look or read that the primary and foreign keys are the relations.
Does anyone have a reason for the inconsistency?
Foreign key constraints are a kind of relation - a subset relation - but these aren't the relations from which the model derives its name. Rather, the relations of the relational model refer to finitary relations. Ted Codd wrote in his 1970 paper A Relational Model of Data for Large Shared Data Banks that "The term relation is used here in its accepted mathematical sense. Given sets S1, S2, ... Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on." Thus, he was describing a structure which can be represented by a table, if we follow some rules like ignoring duplicate rows and the order of rows (it's a set, after all).
Another common misunderstanding is that foreign key constraints represent relationships between entities. They don't. Relationships are represented as sets/tables of rows of associated values. The keys of two or more entities will be recorded together in a row, whether it's in an "entity table" or a "relationship table". Foreign key constraints only enforce integrity, they don't link entities or tables. Tables can be joined on any predicate function, foreign key constraints play no role here.
Most people learn database concepts from blogs, tutorials and answers ranked by popularity. Most people have never read a decent database book, let alone papers by the inventors and students of the relational model of data. Most programmers and corporations want to get the product released and have little time or appreciation for logic, theory and philosophy. It's an inherently complicated field - see Bill Kent's book Data and Reality for an exploration of this complexity. Thus, most of what you'll find on the internet are half-truths at best as people try to make sense of a difficult topic.
People are familiar with records and pointers, due to their prevalence in mainstream programming languages, and they certainly look and sound a lot like entities and relationships. If entities are represented by tables/records, attributes by fields/columns, then 1-to-1 / 1-to-many relationships between entities must be an association between records/tables, right? It's a simple idea, and that makes it difficult to correct. The popularity of object/relational mapping and object-oriented domain models derive from this simple idea (and from well-spoken and sociable authors, unlike the surly attitudes of some relational proponents) but also further entrenches it.
Peter Chen (author of The Entity-Relationship Model - Toward a Unified View of Data made some effort to be rigorous, distinguishing "entity relations" and "relationship relations". In his view, entities were real-world concepts which were represented in a database as values, and described via association of values in rows. Relationships between entities were similarly represented by association of values in rows. The E-R model's distinction between relationships and attributes is somewhat redundant (attributes are just binary relationships) and there's little benefit in distinguishing entity tuples from relationship tuples. In fact, I believe it serves to reinforce the confusion. It's superficial similarity to the older network model helped its adoption but also served to maintain the latter, as developers adopted new terminology while maintaining old practices.
Object-role modeling (aka NIAM, by Sjir Nijssen and Terry Halpin) does away with attributes and focuses on domains, roles and relations. It's more elegant than E-R and much closer to a true relational model, but its strengths (logical, comprehensive, move away from the network model) is also its weaknesses (learning curve, more complicated diagrams, less amenable as a vehicle for familiar techniques).
Ted Codd remarked in the paper mentioned above that "The network model, on the other hand, has spawned a number of confusions, not the least of which is mistaking the derivation of connections for the derivation of relations." This is as true today as it was then. The relational model which he described has since been built on by many others, including Chris Date whose book An Introduction to Database Systems is one of the most comprehensive sources on the topic.
I'm naming all these authors because one more opinion on either side isn't going to clear up your confusion. Rather, go to the sources and study them for yourself. Yes, it's hard work, but your efforts will be repaid in the quality of understanding you'll gain.
Problem domain
I'm working on a rather big application, which uses a hierarchical data model. It takes images, extracts images' features and creates analysis objects on top of these. So the basic model is like Object-(1:N)-Image_features-(1:1)-Image. But the same set of images may be used to create multiple analysis objects (with different options).
Then an object and image can have a lot of other connected objects, like the analysis object can be refined with additional data or complex conclusions (solutions) can be based on the analysis object and other data.
Current solution
This is a sketch of the solution. Stacks represent sets of objects, arrows represent pointers (i.e. image features link to their images, but not vice versa). Some parts: images, image features, additional data, may be included in multiple analysis objects (because user wants to make analysis on different sets of object, combined differently).
Images, features, additional data and analysis objects are stored in global storage (god-object). Solutions are stored inside analysis objects by means of composition (and contain solution features in turn).
All the entities (images, image features, analysis objects, solutions, additional data) are instances of corresponding classes (like IImage, ...). Almost all the parts are optional (i.e., we may want to discard images after we have a solution).
Current solution drawbacks
Navigating this structure is painful, when you need connections like the dotted one in the sketch. If you have to display an image with a couple of solutions features on top, you first have to iterate through analysis objects to find which of them are based on this image, and then iterate through the solutions to display them.
If to solve 1. you choose to explicitly store dotted links (i.e. image class will have pointers to solution features, which are related to it), you'll put very much effort maintaining consistency of these pointers and constantly updating the links when something changes.
My idea
I'd like to build a more extensible (2) and flexible (1) data model. The first idea was to use a relational model, separating objects and their relations. And why not use RDBMS here - sqlite seems an appropriate engine to me. So complex relations will be accessible by simple (left)JOIN's on the database: pseudocode "images JOIN images_to_image_features JOIN image_features JOIN image_features_to_objects JOIN objects JOIN solutions JOIN solution_features") and then fetching actual C++ objects for solution features from global storage by ID.
The question
So my primary question is
Is using RDBMS an appropriate solution for problems I described, or it's not worth it and there are better ways to organize information in my app?
If RDBMS is ok, I'd appreciate any advice on using RDBMS and relational approach to store C++ objects' relationships.
You may want to look at Semantic Web technologies, such as RDF, RDFS and OWL that provide an alternative, extensible way of modeling the world. There are some open-source triple stores available, and some of the mainstream RDBMS also have triple store capabilities.
In particular take a look at Manchester Universities Protege/OWL tutorial: http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial/
And if you decide this direction is worth looking at further, I can recommend "SEMANTIC WEB for the WORKING ONTOLOGIST"
Just based on the diagram, I would suggest that an RDBMS solution would indeed work. It has been years since I was a developer on an RDMS (called RDM, of course!), but I was able to renew my knowledge and gain very many valuable insights into data structure and layout very similar to what you describe by reading the fabulous book "The Art of SQL" by Stephane Faroult. His book will go a long way to answer your questions.
I've included a link to it on Amazon, to ensure accuracy: http://www.amazon.com/The-Art-SQL-Stephane-Faroult/dp/0596008945
You will not go wrong by reading it, even if in the end it does not solve your problem fully, because the author does such a great job of breaking down a relation in clear terms and presenting elegant solutions. The book is not a manual for SQL, but an in-depth analysis of how to think about data and how it interrelates. Check it out!
Using an RDBMS to track the links between data can be an efficient way to store and think about the analysis you are seeking, and the links are "soft" -- that is, they go away when the hard objects they link are deleted. This ensures data integrity; and Mssr Fauroult can answer what to do to ensure that remains true.
I don't recommend RDBMS based on your requirement for an extensible and flexible model.
Whenever you change your data model, you will have to change DB schema and that can involve more work than change in code.
Any problems with DB queries are discovered only at runtime. This can make a lot of difference to the cost of maintenance.
I strongly recommend using standard C++ OO programming with STL.
You can make use of encapsulation to ensure any data change is done properly, with updates to related objects and indexes.
You can use STL to build highly efficient indexes on the data
You can create facades to get you the information easily, rather than having to go to multiple objects/collections. This will be one-time work
You can make unit test cases to ensure correctness (much less complicated compared to unit testing with databases)
You can make use of polymorphism to build different kinds of objects, different types of analysis etc
All very basic points, but I reckon your effort would be best utilized if you improve the current solution rather than by look for a DB based solution.
http://www.boost.org/doc/libs/1_51_0/libs/multi_index/doc/index.html
"you'll put very much effort maintaining consistency of these pointers
and constantly updating the links when something changes."
With the help of Boost.MultiIndex you can create almost every kind of index on a "table". I think the quoted problem is not so serious, so the original solution is manageable.
Specifically thinking of web apps,
(1) why are relationships(ie:foreign keys) in RDBMS even useful?
The web apps I write have logic built-in that validates user input against required fields. I see no real use for foreign keys and thus no real use for relational databases.
Besides, if I were to put all the required field validation logic in the RDBMS(ie:MySQL) it would simply return a vague error. At least with PHP-based validation I know which field is missing and I can notify the user(though with Javascript-based validation this would almost NEVER happen anyway).
(2) Was there a point in the past where RDBMS were useful for some reason or is there a reason they are useful now that I'm not aware of?
I really need some insight on this topic. I'm simply can't come up with a good answer.
I will come at this from a different angle.
I work at a place where we had a database that had no foreign key constraints, default values, or other data checks whatsoever in their initial records database. The lead engineer's excuse for this was something similar to what you have described above. "The application will ensure the referential integrity".
The problem is, we did not have a standard data layer (like an object relational mapping) over the top of the database. We had multiple programmatic sources that fed into the same tables. It was funny because after a while, you could tell which parts of the code created which rows in the table. Sometimes the links lined up, sometimes they didn't. Sometimes the links were NULL (when they shouldn't be), and sometimes they were 0. We even had a few cyclic records which was fun.
My point is, you never know when you are going to need to write a quick script to batch import records, or write a new subsystem that references the same tables. It behooves us as programmers to program as defensively as possible. We can't assume that those who come after us will know as much (if anything) about how our schema should be used.
I'm not much of an SQL lover, but even I must say that the relational structure has its advantages.
It doesn't only allow validation. By providing the database with metadata describing the relations between the actual pieces information stored, a great number of optimizations are possible.
This makes it possible to quickly retrieve large, complex datasets. It also reduces the number of queries needed to make modifications and keep the data coherent, since most of the "book-keeping" is carried out automatically on the DB side of the connection.
One incredibly useful feature of foreign keys in most relational databases are cascades.
Suppose you have a families table and a persons table. Each family can have multiple people, but a person can only belong in one family (one-to-many relationship). If you have foreign keys and you delete a family row, the database can automatically update all the related people, either by deleting them or setting their foreign keys to null.
If you do not have this constraint, you must handle this situation yourself, in your own code.
RDBMSs are still very useful. Not sure why you wouldn't think so. Foreign key constraints can be used to maintain referential integrity (in other words, to provide a simple way to express 1:1, 1:many and many:many relationships. RDBMSs are also useful because there was a rich theory accompanying practical developments, unlike previous DBMSs. In particular, relational calculus/algebra are nice since they allow for good query optimization, normalization, etc.
Not sure if that really answers your question. Wikipedia might list some advantages of RDBMSs.
(1) why are relationships(ie:foreign keys) in RDBMS even useful?
First off, I think you are talking about foreign key CONSTRAINTS. Foreign keys are just a logical design feature that says that this entity matches up with that one.
The reason foreign key constraints are useful are:
They help you adhere to the DRY (Don't repeat yourself) principle. Sure your app validates the relationship, but does it do it in several places? Are there multiple apps that access the same DB? Do you have to repeat the logic in each app? Hey, you could pull that logic out and use a common DLL for access to that data that enforces that logic.Better yet, what if that was built into the RDMBS so I didn't have to write custom code to do something so routine? Bam. Foreign key constraints.
If your app enforces the foreign key validations, how do you force users who are working directly in the DB to honor your rules? I know, I know. You shouldn't let users into the back-end directly, but you just try telling that to the data analysts when they have a project for corporate and you are the bottleneck.
As to the vague error. Wouldn't your argument be better stated as RDBMS X has vague errors when data fails foreign key constraint checks? The way you have generalized it, you could also argue that we should use paper ledgers instead of computers because the constraint had a vague error.
(2) Was there a point in the past where RDBMS were useful for some reason or is there a reason they are useful now that I'm not aware of?
Yeah, that would be now, yesterday and probably long into the future.
I could go on forever about the reasons, but here is the big one...
It provides a common structured file format that is easy to extend, leverage by other applications. You may be too young to remember when every dang system had it's own proprietary structured file format, but it sucked. Plus, it forced you re-invent the wheel constantly in terms of things like indexing, a query language, locking, etc.
"I see no real use for foreign keys and thus no real use for
relational databases"
Judging by this remark, you seem to be underestimating what a relational database is for. Foreign key constraints aren't a defining feature of relational databases and certainly aren't the only reason for using such databases. The relational database model is a powerful and effective way to represent data and it remains so even if you decide you don't want to implement a foreign key constraint. I will therefore assume the question you really meant to ask is: Why are foreign keys useful in relational databases?
A foreign key constraint is just one kind of data integrity constraint. You can of course implement integrity rules outside the database but the DBMS is designed and optimised to do the job for you and is generally the most efficient place to do it because it is closest to the data structures. If you did it outside the database then you would have at least an extra round trip to retrieve the necessary data. You would also have to replicate the DBMS's locking/concurrency model in your application code.
The database optimiser can take advantage of constraints in the database to improve the performance of queries. It can't do that if the rules only exist in your application code.
If you have many applications sharing the same database then implementing data integrity rules in every application is impractical and expensive to maintain. Centralising the constraint logic makes more sense.
Various CASE tools and DBA tools will take advantage of database constraints, can reverse engineer them and use them to assist development and maintenance tasks.
In practice the meaning and function of a database constraint versus some procedural code that validates data only on entry is very different. If X is implemented in a database constraint then I know it is valid for every piece of data in the database. If X is implemented in the application when data is entered then I only know it applies to future data - I can't be sure it applies to everything already in the database (maybe X was only implemented today and didn't apply to the data entered yesterday).
Because they maintain the integrity of the database. If you have all your business logic in the application then in theory they are not needed, but are still useful as a safeguard against bad data.
I'm writing an app in Django where I'd like to make use of implicit inheritence when using ForeignKeys. As far as I'm concerned the only way to handle this nicely is to use django_polymorphic library (no single table inheritence in Django, WHY OH WHY??).
I'd like to know about the performance implications of this solution. What kind of joins are performed when doing polymorphic queries? Does it have to hit the database multiple times as compared to regular queries (the infamous N+1 queries problem)? The docs warn that "the type of queries that are performed aren't handled efficiently by the modern RDBMs"? However it doesn't really tell what those queries are. Any statistics, experiences would be really helpful.
EDIT:
Is there any way of retrieving a list of objects, each being an instance of its actual class with a constant number of queries ?? I thought this is what the aforementioned library does, however now I got confused and I'm not that certain anymore.
Django-Typed-Models is an alternative to Django-Polymorphic which takes a simple & clean approach to solving the single table inheritance issue. It works off a 'type' attribute which is added to your model. When you save it, the class is persisted into the 'type' attribute. At query time, the attribute is used to set the class of the resulting object.
It does what you expect query-wise (every object returned from a queryset is the downcasted class) without needing special syntax or the scary volume of code associated with Django-Polymorphic. And no extra database queries.
In Django inherited models are internally represented through an OneToOneField. If you are using select_related() in a query Django will follow a one to one relation forwards and backwards to include the referenced table with a join; so you wouldn't need to hit the database twice if you are using select_related.
Ok, I've digged a little bit further and found this nice passage:
https://github.com/bconstantin/django_polymorphic/blob/master/DOCS.rst#performance-considerations
So happily this library does something reasonably sane. That's good to know.