Whats the correct normalization of a relationship of three tables? - foreign-keys

I have three tables:
teachers
classes
courses
The sentences is:
A teacher may teachs one or more courses.
A teacher may teachs one or more classes.
A teacher teachs a course for a class.
So I need a fourth table with PRIMARY KEY of each of the three tables composing the PRIMARY KEY or UNIQUE INDEX of the fourth table.
What is the correct normalization for this?
The name of the table: "class_course_teacher" should be ok or I need to use a name like "assignments" for this?
The primary key of the table: "class_id + course_id + teacher_id" should be ok or I need to create "id" for assignments table and this should have "class_id + course_id + teacher_id" as unique index?

Normalization starts with functional dependencies. It's a method of breaking a set of information into elementary facts without losing information. Think of it as logical refactoring.
Your sentences are a start but not sufficient to determine the dependencies. Do courses have one or more teachers? Do classes have one or more teachers? Do courses have one or more classes? Do classes belong to one or more courses? Can teachers "teach" courses without classes (i.e. do you want to record which teachers can teach a course before any classes are assigned)? Do you want to define classes or courses before assigning teachers?
Your two questions don't relate to normalization. assignments is a decent name, provided you won't be recording other assignments (like homework assignments), in which case teacher_assignments or class_assignments might be better. class_course_teacher could imply that there can only be one relationship involving those three entity sets, but it can and does happen that different relationships involve the same entity sets.
I advise against using surrogate ids until you have a reason to use them. They increase the number of columns and indices required without adding useful information, and can increase the number of tables that need to be joined in a query (since you need to "dereference" the assignment_id to get to the class_id, course_id and teacher_id) unless you record redundant data (which has its own problems).

Normalization is about data structures - but not about naming. If the only requirement is "correct normalization", the both decisions are up to you.
Nevertheless good names are important in the real world. I like "assignments" - it is very meaningful.
I like to create ID-columns for each table, they make it easier to change relationships between tables afterwards.

Related

Partition key consisting of multiple values in DynamoDB standard attribute name and value formatting?

I am creating a DDB table which has multiple values make up its partition key and sort key. The primary key is a composite of the partition and sort key.
The partition key would be something like region+date+location and the sort key would be zone+update timestamp millis.
What's the norm for naming these attributes? Is it just naming out the values like region+date+location ? Or some other kind of delimitation? I've also read that it might be better to be generic and just name it something like partitionKey and rangeKey or <typeofthing>id etc. but I've gotten a little pushback on this from my team that the names aren't helpful in that case.
I can't seem to find best practices for this specific question anywhere? Is there a preferred approach for this written down somewhere that I could point to?
There is no "standard" way of naming attributes. But there are two things to consider against mandating a naming standard like region+date+location:
A very long attribute name is wasteful - you need to send it over the network when writing and reading items, and it is included in the item length for which you pay per operation and for storage. I'm not saying it means you should name your attributes "a" and "b", but try not to go overboard on the other direction either.
An attribute name region+date+location implies that this attribute contains only this combination, and will forever contain only this combination. But often in DynamoDB the same attribute name is reused for multiple different types - this reuse is the hallmark of "single table design". That being said, these counterexamples aren't too relevant to your use case, because the attributes overloaded in this way are usually not the key columns as in is your use case.
In your case I think that whatever you decide will be fine. There is no compelling reason to stay away from one of the options you mentioned.

Should I use UUID everywhere or is it alright to mix with usual IDs?

I've chosen to use UUIDs as primary key for all of my tables in DB. But I still have Django default tables like Groups, Permissions, Django_admin_log, etc. Should I override them to make the pk UUID or should I leave it like that?
Leaving it usual integer is simpler of course, but I feel like mixing usual ids and UUIDs in the database is at least confusing. I don't have explicit need to override them to use UUID, but still I can't come up with the conclusion.
UUID's have more storage overhead than serial integers. For this reason I would recommend that they be used only where there is a general benefit for them. You didn't specify which database you are working with but it is possible you could end up bumping other columns to extended storage (and therefore forcing implicit joins).
If your tables are fairly narrow and you don't have to query on id ranges, then it makes little difference.

How do I name my model if the table describes a journal (accounting)?

Model name is usually a singular name of some entity. And table name is a plural form of the word.
For example Transaction is stored in transactions table.
But there are cases when a whole table is described by singular word, means a scope of entities. For example: journal, log, history.
And there no more precise name for a single row besides "entry" or "item". But model named ThingsJounralEntry looks messy, and simple ThingsJournal is confusing, because an instance doesn't describe any actual journal, but single entry.
Is there a common naming approach for such cases better than described above?
Your question seems to show that there are actually two naming issues in dance. One is regarding the elements of your collection, which you ask explicitly. The other one is regarding the collection itself, which is rather implicit. In fact, when you refer to Journal you feel compelled to clarify (accounting). This means that your collection class would be better named AccountingJournal, which would remove the ambiguity.
Now, since the description you provide of these objects (collection and elements) is a little bit succinct, I don't have enough information as to suggest an appropriate name. However, in order to give you some hints I would recommend considering not just the nature of the elements, but the responsibility they will have in your model. Will they represent entities or actions? If the elements are entities (things) consider names that denote simple nouns that would replicate or be familiar with the language used by accountants. Examples include AccountingEntry or AccountingRecord. If your elements represent actions then use a suffix that stresses such characteristic, for example, AccountingAnnotation or AccountingRegistration. Another question you can ask yourself is what kind of messages these elements will receive? For instance, if they will represent additions and subtractions of money you might want to use AccountingOperation or AccountChange.
Whatever the situation dictated by the domain you should verify that your naming conventions sound as sentences said by actual domain experts: "[this is an] accounting record of [something]" or "add this accounting record to the journal" or "register this accounting operation in the journal" or even "this is an accounting operation for subtracting this amount of money," etc.
The (intellectual) activity of naming you objects is directly connected with the activity of shaping your model. Exercise your prospective model in your head by saying aloud the messages that they would typically receive and check that the language you are shaping closely reproduces the language of the domain. In other words, let your objects talk and hear what they say, they will tell you their names.

sentiment analysis to find top 3 adjectives for products in tweets

there is a sentiment analysis tool to find out people's perception on social network.
This tool can:
(1) Decompose a document into a set of sentences.
(2) Decompose each sentence into a set of words, and perform filtering such that only
product name and adjectives are preserved.
e.g. "This MacBook is awesome. Sony is better than Macbook."
After processing, We can get:
{MacBook, awesome}
{Sony, better}. (not the truth :D)
We just assume there exists a list of product names, P, that we will ever
care, and there exist a list of adjectives, A, that we will ever care.
My questions are:
Can we reduce this problem into a specialized association rule mining
problem and how? If yes, anything need to be noticed like reduction, parameter
settings (minsup and minconf), additional constraints, and modication to the
Aprior algorithm to solve the problem.
Any way to artificially spam the result like adding "horrible" to the top 1 adjective? And any good ways to prevent this spam?
Thanks.
Have you considered counting?
For every product, count how often each adjective occurs.
Report the top-3 adjectives for each product.
Takes just one pass over your data, and does not use a lot of memory (unless you have millions of products to track).
There is no reason to use association rule mining. Association rule mining only pays off when you are looking for large itemsets (i.e. 4 or more terms) and they are equally important. If you know that one term is special (e.g. product name vs. adjectives), it makes sense to split the data set by this unique key, and then use counting.

Data structures to implement unknown table schema in c/c++?

Our task is to read information about table schema from a file, implement that table in c/c++ and then successfully run some "select" queries on it. The table schema file may have contents like this,
Tablename- Student
"ID","int(11)","NO","PRIMARY","0","".
Now, my question is what data structures would be appropriate for the task. The problem is that I do not know the number of columns a table might have, neither as to what might the name of those columns be nor any idea about their data types. For example, a table might have just one column of type int, another might have 15 columns of varying data types. Infact, I don't even know the number of tables whose description the schema file might have.
One way I thought of was to have a set number of say, 20 vectors (assuming that the upper limit of the columns in a table is 20), name those vectors 1stvector, 2ndvector and so on, map the name of the columns to the vectors, and then use them accordingly. But it seems the code for it would be a mess with all those if/else statements or switch case statements (for the mapping).
While googling/stack-overflowing, I learned that you can't describe a class at runtime otherwise the problem might have been easier to solve.
Any help is appreciated.
Thanks.
As a C++ data structure, you could try a std::vector< std::vector<boost::any> >. A vector is part of the Standard Library and allows dynamic rescaling of the number of elements. A vector of vectors would imply an arbitrary number of rows with an arbitray number of columns. Boost.Any is not part of the Standard Library but widely available and allows storing arbitrary types.
I am not aware of any good C++ library to do SQL queries on that data structure. You might need to write your own. E.g. the SQL commands select and where would correspond to the STL algorithm std::find_if with an appropriate predicate passed as a function object.
To deal with the lack of knowledge about the data column types you almost have to store the raw input (i.e. strings which suggests std:string) and coerce the interpretation as needed later on.
This also has the advantage that the column names can be stored in the same type.
If you realy want to determine the column type you'll need to speculatively parse each column of input to see what it could be and make decisions on that basis.
Either way if the input could contain a column that has the column separation symbol in it (say a string including a space in otherwise white space separated data) you will have to know the quoting convention of the input and write a parses of some kind to work on the data (sucking whole lines in with getline is your friend here). Your input appears to be comma separated with double quote deliminated strings.
I suggest using std::vector to hold all the table creation statements. After all the creation statements are read in, you can construct your table.
The problem to overcome is the plethora of column types. All the C++ containers like to have a uniform type, such as std::vector<std::string>. You will have different column types.
One solution is to have your data types descend from a single base. That would allow you to have std::vector<Base *> for each row of the table, where the pointers can point to fields of different {child} types.
I'll leave the rest up to the OP to figure out.