Better to use foreign key or to assign unique ids? - foreign-keys

A simplified model of the database is that, say I have a table of A, which has columns a, b, c, d (so that (a, b, c, d) is the primary key). Then I have another table B to store some list-like data for each entry in A, in order to stay with the first normal form.
This B table therefore, will have columns a, b, c, d, e, where each e entry is one element in the list. It is natural to have a foreign key constraint on (a, b, c, d) in B which enforces integrity that every thing must exist in A first then B.
But I wonder if the foreign key constraint will let the database engine to compress or to not duplicate the data storage in B? (In other words, will (a, b, c, d) be stored again verbatim and identical to what is in A?) If no, will assigning each entry in A a unique ID a better choice in this case?

Most SQL-based database engines do require foreign key values to be physically stored at least twice (in the referencing table and in the parent table). It would be nice to have the option not to do that in the case of large foreign keys. Many database designers will choose to avoid large foreign keys, partly because they have this additional overhead.
Most DBMSs do provide the option to compress data - foreign key or not. In many cases that will probably more than compensate for the physical duplication of data due to a foreign key.
Foreign keys are a logical construct however, and in database design it's important to distinguish between logical and physical concerns.

Table Storage: Each MySQL table is stored completely separately. In some cases, two table may live in the same OS file, but the blocks (16KB for InnoDB) will be totally separate. Therefore, (a,b,c,d) shows up in at least 2 places in the dataset -- once in A and once in B.
A FOREIGN KEY has the side effect of creating an extra INDEX is there is not one already there. (In your case, you said it was the PK, so it is already an index.) Note that an FK does not need a UNIQUE index. (In your case, the PK is unique, but that seems irrelevant.)
A secondary index (as opposed to the PRIMARY KEY) for a table is stored in a separate BTree, ordered by the key column(s). So, if (a,b,c,d) had not already been indexed, the FK would lead to an extra copy of (a,b,c,d), namely in the secondary index.
There is one form of compression in InnoDB: You can declare a table to be ROW_FOMAT=COMPRESSED. But this has nothing to do with de-duplicating (a,b,c,d).
Four columns is a lot for a PK, but it is OK. If it is 4 SMALLINT values, then it is only 8 bytes (plus overhead) per row per copy of the PK. If it is a bunch of VARCHARs, then it could be much bulkier.
When should you deliberately add a surrogate id as the PK? In my experience, only about one-third of the cases. (Others will argue.) If you don't have any secondary keys, nor FKs referencing it, then the surrogate is a waste of space and speed. If you have only one secondary key or FK, then the required space is about the same. This last situation is what you described so far.
Table size: If you have a thousand rows, space is not likely to be an issue. A million rows might trigger thinking more seriously about space. For a billion rows, 'pull out all stops'.
PK tips: Don't include DATETIME or TIMESTAMP, someday there will need to be two rows with the same second. Don't put more columns in the PK than are needed for the implicit uniqueness constraint; if you do, you effectively lose that constraint. (There are exceptions.)

Related

AppSync $util.autoId() and DynamoDB Partition and Sort Keys Design Questions

The limits for partition and sort keys of dynamoDB are such that if I want to create a table with lots of users (e.g. the entire world population), then I can't just use a unique partition key to represent the personId, I need to use both partition key and sort key to represent a personId.
$util.autoId() in AppSync returns a 128-bit String. If I want to use this as the primary key in the dynamoDB table, then I need to split it into two Strings, one being the partition key and the other being the sort key.
What is the best way to perform this split? Or if this is not the best way to approach the design, how should I design it instead?
Also, do the limits on partition and sort keys apply to secondary indexes as well?
Regarding $util.autoId(), since it's generated randomly, if I call it many times, is there a chance that it will generate two id's that are exactly the same?
I think I'm misunderstanding something from your question's premise because to my brain, using AppSync's $util.autoId() gives you back a 128 bit UUID. The point of UUIDs is that they're unique, so you can absolutely have one UUID per person in the world. And the UUID string will definitely fit within the maximum character length limits of Dynamo's partition key requirements.
You also asked:
if I call it many times, is there a chance that it will generate two
id's that are exactly the same?
It's extremely unlikely.

Easiest primary key for main table?

My main table, Users, stores information about users. I plan to have a UserId field as the primary key of the table. I have full control of creation and assignment of these keys, and I want to ensure that I assign keys in a way that provides good performance. What should I do?
You have a few options:
The most generic solution is to use UUIDs, as specified in RFC 4122.
For example, you could have a STRING(36) that stores UUIDs. Or you could store the UUID as a pair of INT64s or as a BYTE(16). There are some pitfalls to using UUIDs, so read the details of this answer.
If you want to save a bit of space and are absolutely sure that you will have fewer than a few billion users, then you could use an INT64 and then assign UserIds using a random number generator. The reason you want to be sure you have fewer than a few billion users is because of the Birthday Problem, the odds that you get at least one collision are about 50% once you have 4B users, and they increase very fast from there. If you assign a UserId that has already been assigned to a previous user, then your insertion transaction will fail, so you'll need to be prepared for that (by retrying the transaction after generating a new random number).
If there's some column, MyColumn, in the Users table that you would like to have as primary key (possibly because you know you'll want to look up entries using this column frequently), but you're not sure about the tendency of this column to cause hotspots (say, because it's generated sequentially or based on timestamps), then you two other options:
3a) You could "encrypt" MyColumn and use this as your primary key. In mathematical terms, you could use an automorphism on the key values, which has the effect of chaotically scrambling them while still never assigning the same value multiple times. In this case, you wouldn't need to store MyColumn separately at all, but rather you would only store/use the encrypted version and could decrypt it when necessary in your application code. Note that this encryption doesn't need to be secure but instead just needs to guarantee that the bits of the original value are sufficiently scrambled in a reversible way. For example: If your values of MyColumn are integers assigned sequentially, you could just reverse the bits of MyColumn to create a sufficiently scrambled primary key. If you have a more interesting use-case, you could use an encryption algorithm like XTEA.
3b) Have a compound primary key where the first part is a ShardId, computed ashash(MyColumn) % numShards and the second part is MyColumn. The hash function will ensure that you don't create a hot-spot by allocating your rows to a single split. More information on this approach can be found here. Note that you do not need to use a cryptographic hash, although md5 or sha512 are fine functions. SpookyHash is a good option too. Picking the right number of shards is an interesting question and can depend upon the number of nodes in your instance; it's effectively a trade-off between hotspot-avoiding power (more shards) and read/scan efficiency (fewer shards). If you only have 3 nodes, then 8 shards is probably fine. If you have 100 nodes; then 32 shards is a reasonable value to try.

Whats the correct normalization of a relationship of three tables?

I have three tables:
teachers
classes
courses
The sentences is:
A teacher may teachs one or more courses.
A teacher may teachs one or more classes.
A teacher teachs a course for a class.
So I need a fourth table with PRIMARY KEY of each of the three tables composing the PRIMARY KEY or UNIQUE INDEX of the fourth table.
What is the correct normalization for this?
The name of the table: "class_course_teacher" should be ok or I need to use a name like "assignments" for this?
The primary key of the table: "class_id + course_id + teacher_id" should be ok or I need to create "id" for assignments table and this should have "class_id + course_id + teacher_id" as unique index?
Normalization starts with functional dependencies. It's a method of breaking a set of information into elementary facts without losing information. Think of it as logical refactoring.
Your sentences are a start but not sufficient to determine the dependencies. Do courses have one or more teachers? Do classes have one or more teachers? Do courses have one or more classes? Do classes belong to one or more courses? Can teachers "teach" courses without classes (i.e. do you want to record which teachers can teach a course before any classes are assigned)? Do you want to define classes or courses before assigning teachers?
Your two questions don't relate to normalization. assignments is a decent name, provided you won't be recording other assignments (like homework assignments), in which case teacher_assignments or class_assignments might be better. class_course_teacher could imply that there can only be one relationship involving those three entity sets, but it can and does happen that different relationships involve the same entity sets.
I advise against using surrogate ids until you have a reason to use them. They increase the number of columns and indices required without adding useful information, and can increase the number of tables that need to be joined in a query (since you need to "dereference" the assignment_id to get to the class_id, course_id and teacher_id) unless you record redundant data (which has its own problems).
Normalization is about data structures - but not about naming. If the only requirement is "correct normalization", the both decisions are up to you.
Nevertheless good names are important in the real world. I like "assignments" - it is very meaningful.
I like to create ID-columns for each table, they make it easier to change relationships between tables afterwards.

normalize or not?

I have a DB in which there are 4 tables.
A -> B -> C -> D
Current the way I have it is, the Primary Key of A is a foreign key in B. And B would have it's own Primary Key, which is a foreign key in C, etc etc.
However, C can't be linked to A without B.
The problem is, a core function of my program involve pulling matching entries from A and D.
Should I include the primary key of A in D too
Doing so will create unnecessary data duplication 'coz A->B->C->D are hierarchy.
see pic for what D would look like.
If you take all D-s in relation with given A, I would keep it normalized.
But if you want specific subset of such D-s and its easy to know which in advance, but time consuming later (eg. if you want all D-s from newest C from newest B), I would prefare storing this shortcut somewhere.
It does not have to be in D itself (esp. if you don't want all D-s connected with A).
If you want to do it to make your queries easier to read and write, then consider view.
If you want to do it to increase performance, try everything and measure it. (And I'm not expert in performance tuning of SQL, so I have no specific advice beyond that)

What is the best way to store a relation in main memory?

I am working on an application which is a mini DBMS design for evaluating SPJ queries. The program is being implemented in C++.
When I have to process a query for joins and group-by, I need to maintain a set of records in the main memory. Thus, I have to maintain temporary tables in main memory for executing the queries entered by the user.
My question is, what is the best way to achieve this in C++? What data structure do I need to make use of in order to achieve this?
In my application, I am storing data in binary files and using the Catalog (which contains the schema for all the existing tables), I need to retrieve data and process them.
I have only 2 datatypes in my application: int (4 Bytes) and char (1 Byte)
I can use std:: vector. In fact, I tried to use vector of vectors: the inner vector is used for storing attributes, but the problem is there can be many relations existing in the database, and each of them may be any number of attributes. Also, each of these attributes can be either an int or a char. So, I am unable to identify what is the best way to achieve this.
Edit
I cannot use a struct for the tables because I do not know how many columns exist in the newly added tables, since all tables are created at runtime as per the user query. So, a table schema cannot be stored in a struct.
A Relation is a Set of Tuples (and in SQL, a Table is a Bag of Rows). Both in Relational Theory and in SQL, all tuples (/rows) in a relation (/table) "comply to the heading".
So it is interesting to make an object to store relations (/tables) consist of two components: an object of type "Heading" and a Set (/Bag) object containing the actual tuples (/rows).
The "Heading" object is itself a Mapping of attribute (/column) names to "declared data types". I don't know C, but in Java it might be something like Map<AttributeName,TypeName> or Map<AttributeName,Type> or even Map<String,String> (provided you can use those Strings to go get the actual 'Type' objects from wherever they reside).
The set of tuples (/rows) consists of members that are all a Mapping of attribute (/column) names to attribute Values, which are either int or String, in your case. Biggest problem here is that this suggests that you need something like Map<AttributeName,Object>, but you might get into trouble over your int's not being an object.
As a generic container for any table rows, I'd most likely use std::vector (as pointed out by Iarsmans). As for the table columns, I'd most likely define those with structs representing the table schema. For example:
struct DataRow
{
int col1;
char col2;
};
typedef std::vector<DataRow> DataTable;
DataTable t;
DataRow dr;
dr.col1 = 1;
dr.col2 = 'a';
t.push_back(dr);