Broken referential integrity: What would Edgar Codd say? - referential-integrity

I'm trying to understand rules of relational model as originally defined by Edgar Codd in 1970.
Specifically I'm interested whether referential integrity is part of his relational model or not. I'll try to demonstrate on following example (just to make this question pretty):
Customers
+------+------------
| Name | Address
|------+------------
| John | ....
| Mike | ....
| Kate | ....
+------+------------
Invoices
+------+------------
| ID | Customer
|------+------------
| 1 | John
| 2 | John
| 3 | Mary
+------+------------
Now, obviously as you can see, we have one invoice where customer (foreign key) is Mary. Would this violate his relational model? Would Edgar Codd look at this and say, gee, what the heck? Or would he say, it's perfectly fine...
This is theoretical question.

If there is no customer named Mary in the Customers table, then there is no referential integrity between the tables. Specifically, a foreign key refers to a non-existent primary key.
Does this break the relational model? No. It's defined in the relational model (i.e. lack of referential integrity) and is an indication that there is a problem with the underlying data.
From "A Relational Model of Data for Large Shared Data Banks" by Edgar Codd (from Communications of the ACM, Volume 13, Number 6, June 1970):
It could be the case that the user intended to insert some other
element into P - an element whose insertion would transform a
consistent state into a consistent state. The point is that the system
will normally have no way of resolving this question without
interrogating its environment (perhaps the user who created the
inconsistency).
So, it is assumed that there will be referential integrity issues and that they will need to be resolved by the user or the system via some programmatic method.

For a language to be considered relationally complete (a phrase coined by Codd) it must support a set of relational operators, known as a relational algebra. Note there is no one true relational algebra: Codd proposed the first one but others have since refined and built upon Codd's (e.g. The Third Manifesto) and I'm sure he would see this as right and proper.
Referential integrity is not a relational operator and therefore is not a requirement for relational completeness of a language. Whether referential integrity constraints are a useful or necessary feature of a DBMS is another matter.

The Relational Model doesn't require referential integrity features to apply to every relational database - that would be absurd if such constraints weren't relevant or desired. Think of a club membership list consisting of name, address and membership number. There wouldn't necessarily be any use for RI constraints there, but it's still a relational database if the data is stored in the form of a relation.
Even Codd's 13 rules don't require that a RDBMS has to support the ability to create RI constraints. It's just that foreign keys are so useful that most RDBMSs are expected to have them.

I read the following as clearly stating that referential integrity is included in the relational model:
Two integrity rules apply to every
relational database:
1 Entity integrity:
No mark of either
type is permitted in any attribute
which is a component of the primary
key of a base relation
2 Referential integrity:
Let D be a
domain from which one or more
single-attribute primary keys draw
their values. Let K be a foreign key
which draws its values from domain D.
Every unmarked value which occurs in
K must also exist in the database as
a value in the primary key of some
base relation.
"Missing information (applicable and inapplicable) in relational databases," E. F. Codd, ACM SIGMOD Record, vol. 15, no. 4, pp. 53-78, 1986.
By "mark of either type" he is referring to an unknown value, for which we use NULL today. This paper suggested two different types of unknown values, one for "applicable but missing," and one for "inapplicable."
By "unmarked" he means not NULL.
Re comment from #dportas: Indeed, you don't even need the referenced relation to be empty to make your argument. It can contain some rows, but since the A-mark in K cannot be said to be equal to any value that exists in that referenced relation, there's no way to say that the hypothetical missing value satisfies the constraint. Therefore allowing an A-mark must become an act of faith that once a value is supplied, it will satisfy the constraint, because otherwise the row would have been invalid from the moment it was inserted, and we'd have to support the concept of a retroactive constraint violation, which is senseless.

First you ask is RI part of the RM:
whether referential integrity is part of his relational model or not
Yes. From Codd's classic "Is your DBMS really relational?" Computerworld, October 14, 1985:
It is, however, vitally important to remember that the relational model includes three major parts: the structural part, the manipulative part and the integrity part -- a fact that is frequently and conveniently forgotten.
Rule 10: Integrity constraints specific to a particular relational data base must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
But then you paraphrase by a different and ambiguous question:
we have one invoice where customer (foreign key) is Mary. Would this violate his relational model?
If you mean: Does the RM allow a declared FK be violated, ie not stopped by the DBMS?
No. That would be a DBMS that is letting you declare a FK constraint but isn't enforcing it. Such a DBMS is non-relational in that respect.
If you mean: Does the RM allow a business rule that says an Invoices Customer must also appear in Customers Name (ie that all valid database states are like that, ie that there is a FK constraint from Invoices Customer to Customers Name) to be not declared to the DBMS (eg via a FK declaration)?
Yes. But that's a bad design because it allows some invalid states.

I think that whether this is fine or not depends on your design.
An invoice should contain the data as it was at the moment the invoice was created or sent. As such it would appear to need data that is related to customer data but not directly a foreign key especially if you are using a natural key.
For instance suppose Mary Jones ordered something and was invoiced on May 31, 2010. On Sept 12, 2010, she changed her name to Mary Jones-Smith and moved to her husband's address. The invoice, being a picture in time, should retain the name Mary Jones and the orginal address it was sent to. It is best of it can retain a link to the current customer and her information as well (Which is why I would have a customer ID in the customer table as names change and an FK of Customerid inteh incvoice table). But storing Mary Jones when Mary Jones no longer exists in the customer table is not only OK, it is necessary to have a trail of what actually happened.
Same thing with products and prices and invoices. You would not want the invoice to reflect the price now, but the proce at the time of the invoice even if that doesn't directly relate to what is there now. In this cases the the Product table might be more of a lookup table than a true parent child relationship. If you store all the details of the product in the invoice detail table, then you don't need a foreign key to products, you only need it to look up active products at the time the order is placed. In fact the model number of a past invopice may certainly no longer be in the products table if the vendor changed it or dropped the product entirely. But you wouldn't want to lose the data about which of those products were bought in the past.
On the other hand if the relationship requires the data to stay consistent with the current values, a formal foreign key is the best method.

Related

modeling datawarehouse multilanguage

I need your help.
I work for a survey company and I am responsible for creating its architecture and modeling a data warehouse that analyzes the results of an international survey (50 countries).
For the architecture, we decided to create a tabular model in PowerBI to analyze our data and to create our reports.
Here below is the model as I thought:
However, I have a design problem.
Since the survey is international, the wording of my dimensions differs from country to country.
My 1st question:
-Would it make more sense to create only one PowerBI embedded model for all countries or 50 PowerBI reports?
My 2nd question:
My model must be multilingual
With my 50 countries, I have several languages (5 languages) and for the same language, I have several variants.
The British English labels differ from the US English labels.
For example, for the Response dimension for France the IdReponse = 1 has the wording 'Vrai' while for the USA the wording is 'True' and for the Britain is 'OK'.
Do you know how to model multi language in a data warehouse?
About question #1 - It's always better, if there is only one model. It will be much easier to maintain. It isn't clear from your question will these 50 reports show the same data (excluding the internationalization of texts like Vrai/True/OK), or each report/country should show it's own subset of the data. In case all reports will show the same data, then definitely it will be better to make one common model and all report use it. You can do this with Power BI by making one "master" report and publishing it, and then the rest of your "per country" reports use it as a data source. And you will need separate reports per country, because you will need to translate the texts (column names, static texts, etc.).
About question #2 - You can create lookup tables in your model (maybe even in the database, it's up to you). The key value (1) will be linked to the key of the table, and there will be columns per language. Depending on the language of the current report, you will select the appropriate column (e.g. French, British, etc.) and even you can fallback to let's say US English, in case there is no translation entered for the current language (e.g. by making a computed column). It is also an option to make separate lookup table per language, but I think it will be more cumbersome to maintain this way.
About question #1: Yes you need only one data model.
About question #2: You Load a question in the language it is asked and the response you get as is in the response DIM. You should create a new column in your response DIM such as Clean_response where you transformed original response to a uniformed value. for example "Vrai", "OK", "True" has same meaning so you may chose to put "Yes" in the Clean_response column. You can also convert different variation of "No", "Nada", "noops", "nah" to a clean value of "No", but keep the original value too.
Labeling a column in the report should be handle in the report code. For example writing a report in French should use your dim column name "Question" and show it as "interroger" as a heading on the report.

Modeling Data - Invoices and Line Items

I'm creating a web based point of sale (think cash register) solution with Django as the backend. I've always taken the 'classic' approach of modeling invoices and their line items.
InvoiceTable
id
date
customer
salesperson
discount
shipping
subtotal
tax
grand_total
[...]
InvoiceLineItems
invoice_id // foreign key
product_id
unit_price
qty
item_discount
extended_price
[...]
After attempting to research best practices, I've found that there aren't many - at least no definitive source that's widely used.
The Kimball Group suggests: "Rather than holding onto the operational notion of a transaction header “object,” we recommend that you bring all the dimensionality of the header down to the line items."
See http://www.kimballgroup.com/2007/10/02/design-tip-95-patterns-to-avoid-when-modeling-headerline-item-transactions/ and http://www.kimballgroup.com/2001/07/01/design-tip-25-designing-dimensional-models-for-parent-child-applications/.
I'm new to development (only having used desktop database software before) - but from my understanding this makes sense as we can drill down the data any way we want for reporting purposes (though I imagine we could do the same with the first method by joining the tables).
My Questions
The invoice ID will need to be repeated for each row (so we can generate data like totals for the invoice). Is this an intentional feature of this way of modeling the data?
We often have invoice level data like notes, discounts, shipping charges, etc. - How do we represent these using this method? Some discounts are product specific - so they belong on the line item anyway, others are invoice wide (think of a deal where you buy two separate products and receive a discount on the two) - we could we somehow allocate it across the line items? Same with shipping charges, allocate it by dividing it among the line items?
What do we do with invoice 'notes' - we have printed and/or internal notes, would we put the data in the line items and just repeat it for each line item? That seems to go against data normalization. Put it in a related table?
Any open source projects that use this method that I could take a look at? Not sure how to search for them.
It sounds like you're confusing relational design and dimensional design.
A relational design is for facilitating transaction processing, and minimizing data anomalies and duplication. It's your operational database. A dimensional design is for facilitating analysis.
A relational design will have an invoices table and a line_items table and a dimensional design will have a company_invoices_customer fact table with a grain of invoice line item.
Since this is for POS, I assume you want a relational design first.
As for your questions:
First there are tons of good data modelling patterns for this scenario. See https://dba.stackexchange.com/questions/12991/ready-to-use-database-models-example/23831#23831
The invoice ID will need to be repeated for each row (so we can
generate data like totals for the invoice). Is this an intentional
feature of this way of modeling the data?
Yes
We often have invoice level data like notes, discounts, shipping
charges, etc. - How do we represent these using this method?
Probably easiest/simplest to have a "notes" field on the invoice table.
For charges and discounts you should use abstraction (see Table Inheritance), and add them as Order Adjustments. See the book by Silverston in the link above.
Some discounts are product specific - so they belong on the line item
anyway, others are invoice wide (think of a deal where you buy two
separate products and receive a discount on the two) - we could we
somehow allocate it across the line items?
The price of the item should be calculated at runtime based on it's default price, and any discounts or charges that apply in the current "scenario", example discount for government, nearby, on sale day. You could have hierarchical line items that reference each other, to keep things in order. Again, see Silverston book.
What do we do with invoice 'notes' - we have printed and/or internal
notes, would we put the data in the line items and just repeat it for
each line item?
If you want line item notes, add a notes column on the line items table.
That seems to go against data normalization. Put it in a related
table?
If notes are nullable, and you want to be strict about normalization, then yes, add a invoice_notes table.

Zend2 Doctrine2 One-To-Many uni-directional with join table, delete cascade issue

I'm having some problems with the following...
I have a table with phone numbers. I want to use the same table for both users and companies. A user can have several phone numbers and a company too. So i want a One to many unidirectional relationship with two different join tables. One linking phone numbers to users, the other linking phone numbers to companies.
This is solution following the doctrine2 manual chapter 5.9 found here: (click)
My users entity holds this code:
/** #ORM\ManyToMany(targetEntity="Application\Entity\PhoneNumber")
* #ORM\JoinTable(name="user_phone_number_linker",
* joinColumns={#ORM\JoinColumn(name="user_id", referencedColumnName="id")},
* inverseJoinColumns={#ORM\JoinColumn(name="phone_number_id", referencedColumnName="id")}
* )
*/
protected $phone_numbers;
I use a unidirectional one to many because the thing is I can't make a bidirectional one because if I refer back to the user I cannot use the same phone number entity class for the company. Now it all works fine, but when I delete a phone number I get the following error:
An exception occurred while executing 'DELETE FROM phone_number WHERE id = ?' with params {"1":1}:
SQLSTATE[23000]: Integrity constraint violation: 1451 Cannot delete or update a parent row: a foreign key constraint fails (database/user_phone_number_linker, CONSTRAINT user_phone_number_linker_ibfk_11 FOREIGN KEY (phone_number_id) REFERENCES phone_number (id))
If I set the ON DELETE CASCADE value manually in the database it works fine, but this is not the idea of using doctrine2 and I think I should be able to solve it within the code without going to my phpMyAdmin panel. Somehow the cascading from the phone number towards the join table should be initiated on deletion, but without making a reference back to the join table from the phone_number entity.
Hope someone smart can help me solve this.
EDIT
In the meantime I learned a lot more about Doctrine2 and reviewing my old question made me realize that this is not a correct way to store several phoneNumbers in one table in the first place. To be able to store user phone numbers and company phone numbers in the same table I should use table inheritance with a discriminator column. The column should hold some user/company discriminator.
Because of this column the doctrine ORM will "know" if that phoneNumber is actually a user or a company phone-number. I need to make two different entity definitions following the single table inheritance mapping principles from the doctrine 2 specs.
One class UserPhoneNumber will have a many-to-one relationship with User the other called CompanyPhoneNumber a one-to-many relationship with Company. I don't necessarily need a join column, the user_id or company_id columns can be in the phone-number table. In the User class the Company association is omitted and in the Company class the User association is omitted (database should allow null values for those columns).
If I do use a join table it is according to the one-to-many unidirectional with join table description in the Doctrine2 specs
READ MORE
Otherwise you can also read more on associations and cascade issues here on this elaborate Doctrine2 in depth website.
As you said, your relation is unidirectional. You've defined a relation from Users to PhoneNumbers. The cascade delete will work when you delete a User, it will remove all rows in user_phone_number_linker because that's the relation you've defined.
If you want to do it the other way, you've got to create a relation from PhoneNumbers to Users. Doctrine needs it to work for you. But you have the problem that the entity is shared by two other entities, Users and Companies.
Keep in mind that entities are objects, not tables. So you could try to create two entities to the same table, one named PhoneNumberUsers and the other PhoneNumberCompanies. This way you'll be able to create the needed relation to do the cascade delete. I haven't tested by myself, but I think it could work.
By the way, you can remove the oncascade parameter on the Users' entity join table. I've the same scenario as you with users and roles, and I haven't used it. I think it's only needed when you want to cascade from entity to entity. I'm not sure about that, but that's what I've been experiencing until now.
My bad,
The phone number user relationship is regarded a Many-To-Many relationship, so if want to remover the phone number I should not only remove the phone number itself, but I have to explicitly remove the phone number from the user as well. So in the Controller like this:
// Remove the phone number user connection from the database
$user->removePhoneNumber($phone_number);
// Remove the phone number from the database
$em->remove($phone_number);
I just thought the unique restriction which makes the relationship to a unidirectional One-To-Many would be enough to make doctrine take care of it. That was not correct.

Can you move compound keys and/or foreign keys to other tables when normalizing to 3NF (third normal form)

My database design is currently at 3NF. The issue is foreign keys and in some cases compound keys.
Can you move compound keys and/or foreign keys to create other tables provided the attributes associated with the compound/foreign keys do not rely on the primary key?
I suspect the answer is yes due to this link:
Are Foreign Keys included in Third Normal Form?
Best Answer: Just because it's a foreign key doesn't mean it also can't be considered an attribute of the primary key. The fact that it's a foreign key to begin with implies it's defining a relationship with another table, and thus would not violate [...] 3NF.
-- TheMadProfessor
https://answers.yahoo.com/question/index?qid=20081117095121AAXWBbX#
This leads me to wonder whether my current normalization stage is 3NF.
Preamble
In pure relational database theory, there is nothing to stop you having composite primary keys (PKs), and you can have foreign keys (FKs) that reference them and those FKs are necessarily composite too. Some software has difficulty with composite keys, so you often find that people add an ID column which contains an automatically generated number, which is then designated as the PK of the table. Other tables can then have (simple) FKs that reference the (simple) ID column. One not uncommon mistake is to forget that the original composite PK is still a candidate key (CK), and its uniqueness should be enforced by the DBMS with a unique constraint on the table; it becomes an alternative key (AK).
Diversion
The system of CKs, AKs and PKs works like this:
Every CK is a set of (one or more) columns that is a unique identifier for the data in the rest of each row of data in a table.
One CK may be designated as the PK.
The other CKs become AKs.
Consider this table:
CREATE TABLE elements
(
atomic_number INTEGER NOT NULL PRIMARY KEY
CHECK (atomic_number > 0 AND atomic_number < 120),
symbol CHAR(3) NOT NULL UNIQUE,
name CHAR(20) NOT NULL UNIQUE,
atomic_weight DECIMAL(8,4) NOT NULL,
period SMALLINT NOT NULL
CHECK (period BETWEEN 1 AND 7),
group CHAR(2) NOT NULL
-- 'L' for Lanthanoids, 'A' for Actinoids
CHECK (group IN ('1', '2', 'L', 'A', '3', '4', '5', '6',
'7', '8', '9', '10', '11', '12', '13',
'14', '15', '16', '17', '18')),
stable CHAR(1) DEFAULT 'Y' NOT NULL
CHECK (stable IN ('Y', 'N'))
);
Each of atomic_number, symbol and name is a candidate key. For chemistry, the symbol is most convenient as the primary key; for physics, the atomic_number is most convenient. The tables related to isotopes etc reference the atomic_number column, but the tables related to chemical compounds reference the symbol column. The three CKs here are all simple; on the other hand, the isotopes table has a compound PK consisting of the atomic number of the element (the number of protons) and the number of neutrons.
Answer
Getting back to your question, your data may well be in 3NF, or more likely BCNF (which is formally stronger than 3NF).
You'd have to show us your table schemas and specify the constraints (functional dependencies, etc) that apply to the columns before we could assess your design. But there is nothing that you've described which, a priori, prevents it from being well normalized.
I'm not sure that I understand your question. If you are asking, "Can you have a foreign key in a table without violating 3NF?" the answer is absolutely, positively yes. Nothing about any stage of normalization says that you should eliminate foreign keys. Indeed, it's pretty much impossible to normalize all but the most trivial data without using foreign keys.
** Update **
Okay, maybe now I understand your question, but then I think you've answered it for yourself. Yes, in a fully-normalized DB, you should not have non-key dependencies. If you have FKs that are not dependent on the PK, then they should be moved to another table.
To make a simple example, suppose you want to keep track of people, the city they live in, and the country that that city is in. So for your first draft, you make this structure: (Asterisk marks the PK.)
Person (person_id*, person_name, city_id, country_id)
City (city_id*, city_name)
Country (country_id*, country_name)
This is not normalized. A city is in the same country regardless of what resident of that city we are talking about. Paris is not in France when we are talking about Pierre but in Germany when we are talking about Francois. (If there are two cities with the same name, of course those are different cities and should have different records. I suppose a city could cross national boundaries, but for our purposes here let's assume that if it does, we consider it two cities that happen to touch. They would surely have different city governments, different postal systems, etc.) So we have a non-key dependency. country_id depends on city_id, not on person_id.
So to normalize this schema, we should move the country_id to a table where it is dependent solely on the PK. Presumably, the City table:
Person (person_id*, person_name, city_id)
City (city_id*, city_name, country_id)
Country (country_id*, country_name)
Do you understand what a FD (functional dependency) is? A FD is an expression with an arrow between two attribute sets. Given a table and a FD, saying that the FD holds in the table or the table has the FD or the table satisfies the FD says all subrows for the first attribute set appear with the same subrow for the second attribute set.
Do you understand that normalization involves CKs (candidate keys)? Independent of normalization we can call one CK a PK (primary key) and the others AKs (alternate keys).
Do you understand what a FK (foreign key) is? Given a database, a table and an attribute list, saying that the list is a FK referencing some attribute list in some table says that every list of values for the attributes in the first table is also a list of of values for the attributes in the second table where those attributes form a CK (candidate key).
Normalization uses FDs & JDs (join dependencies) to replace a table by projections of it that join back to it. FKs (foreign keys) are irrelevant to normalization. That's both to whether a table is in a given NF (normal form) and to decomposing to a given NF. The answer you link to is also saying that FKs are irrelevant to whether a table is in 3NF--first for a specific case, then for the general case.
Because components share column values from an original table, some FKs will arise from normalization. Just as various tables and their PKs, AKs, CKs, superkeys, FDs & JDs will. That is normalization output, not input.
Since normalization replaces tables by projections of them, column sets in common among the original & components must contain the same subrow values. That's an EQD (equality dependency). Subrow values that form CKs will thus have FKs to them arise from normalization.
But often during normalization we see that we want to replace a table by some that are projections with others that have at least the subrows of a projection. Common subrows in a projection must then appear in expanded components, but not vice versa. That's an IND (inclusion dependency). There will still be a FK when common columns form a CK in a table that is a superset of a projection. Such a design change isn't normalization, it is just a change that you noticed during normalization from a design that was wrong in not allowing all the possible business situation to be recorded to a design that doesn't have that problem.
"Compound" vs "simple" vs "empty" apply to FKs, PKs, AKs, CKs (candidate keys), & superkeys and are relevant to FD & JD parts. Definitions will specify compound/simple/empty when it matters. (Eg: Sometimes FDs are put into canonical forms involving single columns. Sometimes we can easily infer NFs hold partly based on whether CKs are simple.)
After you get your tables, declare (sufficient) constraints. Then the DBMS can enforce them. FKs, PKs, AKs, CKs, superkeys, FDs & JDs all have associated constraints. SQL lets you declare all PKs & AKs (via PRIMARY KEY & UNIQUE NOT NULL). Those declarations actually declare superkeys which happen to be CKs/PKs/AKs when no smaller ones are declared within them. Similarly SQL FOREIGN KEY declares a foreign superkey that is a FK if it is actually to a CK. Declare sufficient chains of FKs to enforce the ones you don't declare. (Via transitivity.) SQL DBMSs typically won't let you declare FK cycles. SQL also makes you declare superkeys referenced by FK declarations whether or not those columns contain a smaller declared superkey/CK and so must be a superkey. Declare or enforce via triggers any FD or JD constraints that aren't implied by CK constraints. (5NF gets rid of all such constraints except some cycles of FDs on CKs.)
Find academic textbook definitions and algorithms.

Django: Query with F() into an object not behaving as expected

I am trying to navigate into the Price model to compare prices, but met with an unexpected result.
My model:
class ProfitableBooks(models.Model):
price = models.ForeignKey('Price',primary_key=True)
In my view:
foo = ProfitableBooks.objects.filter(price__buy__gte=F('price__sell'))
Producing this error:
'ProfitableBooks' object has no attribute 'sell'
Is this your actual model or a simplification? I think the problem may lie in having a model whose only field is its primary key is a foreign key. If I try to parse that out, it seems to imply that it's essentially a field acting as a proxy for a queryset-- you could never have more profitable books than prices because of the nature of primary keys. It also would seem to mean that your elided books field must have no overlap in prices due to the implied uniqueness constraints.
If I understand correctly, you're trying to compare two values in another model: price.buy vs. price.sell, and you want to know if this unpictured Book model is profitable or not. While I'm not sure exactly how the F() object breaks down here, my intuition is that F() is intended to facilitate a kind of efficient querying and updating where you're comparing or adjusting a model value based on another value in the database. It may not be equipped to deal with a 'shell' model like this which has no fields except a joint primary/foreign key and a comparison of two values both external to the model from which the query is conducted (and also distinct from the Book model which has the identifying info about books, I presume).
The documentation says you can use a join in an F() object as long as you are filtering and not updating, and I assume your price model has a buy and sell field, so it seems to qualify. So I'm not 100% sure where this breaks down behind the scenes. But from a practical perspective, if you want to accomplish exactly the result implied here, you could just do a simple query on your price model, b/c again, there's no distinct data in the ProfitableBooks model (it only returns prices), and you're also implying that each price.buy and price.sell have exactly one corresponding book. So Price.objects.filter(buy__gte=F('sell')) gives the result you've requested in your snipped.
If you want to get results which are book objects, you should do a query like the one you've got here, but start from your Book model instead. You could put that query in a queryset manager called "profitable_books" or something, if you wanted to substantiate it in some way.