normalize or not? - foreign-keys

I have a DB in which there are 4 tables.
A -> B -> C -> D
Current the way I have it is, the Primary Key of A is a foreign key in B. And B would have it's own Primary Key, which is a foreign key in C, etc etc.
However, C can't be linked to A without B.
The problem is, a core function of my program involve pulling matching entries from A and D.
Should I include the primary key of A in D too
Doing so will create unnecessary data duplication 'coz A->B->C->D are hierarchy.
see pic for what D would look like.

If you take all D-s in relation with given A, I would keep it normalized.
But if you want specific subset of such D-s and its easy to know which in advance, but time consuming later (eg. if you want all D-s from newest C from newest B), I would prefare storing this shortcut somewhere.
It does not have to be in D itself (esp. if you don't want all D-s connected with A).
If you want to do it to make your queries easier to read and write, then consider view.
If you want to do it to increase performance, try everything and measure it. (And I'm not expert in performance tuning of SQL, so I have no specific advice beyond that)

Related

Better to use foreign key or to assign unique ids?

A simplified model of the database is that, say I have a table of A, which has columns a, b, c, d (so that (a, b, c, d) is the primary key). Then I have another table B to store some list-like data for each entry in A, in order to stay with the first normal form.
This B table therefore, will have columns a, b, c, d, e, where each e entry is one element in the list. It is natural to have a foreign key constraint on (a, b, c, d) in B which enforces integrity that every thing must exist in A first then B.
But I wonder if the foreign key constraint will let the database engine to compress or to not duplicate the data storage in B? (In other words, will (a, b, c, d) be stored again verbatim and identical to what is in A?) If no, will assigning each entry in A a unique ID a better choice in this case?
Most SQL-based database engines do require foreign key values to be physically stored at least twice (in the referencing table and in the parent table). It would be nice to have the option not to do that in the case of large foreign keys. Many database designers will choose to avoid large foreign keys, partly because they have this additional overhead.
Most DBMSs do provide the option to compress data - foreign key or not. In many cases that will probably more than compensate for the physical duplication of data due to a foreign key.
Foreign keys are a logical construct however, and in database design it's important to distinguish between logical and physical concerns.
Table Storage: Each MySQL table is stored completely separately. In some cases, two table may live in the same OS file, but the blocks (16KB for InnoDB) will be totally separate. Therefore, (a,b,c,d) shows up in at least 2 places in the dataset -- once in A and once in B.
A FOREIGN KEY has the side effect of creating an extra INDEX is there is not one already there. (In your case, you said it was the PK, so it is already an index.) Note that an FK does not need a UNIQUE index. (In your case, the PK is unique, but that seems irrelevant.)
A secondary index (as opposed to the PRIMARY KEY) for a table is stored in a separate BTree, ordered by the key column(s). So, if (a,b,c,d) had not already been indexed, the FK would lead to an extra copy of (a,b,c,d), namely in the secondary index.
There is one form of compression in InnoDB: You can declare a table to be ROW_FOMAT=COMPRESSED. But this has nothing to do with de-duplicating (a,b,c,d).
Four columns is a lot for a PK, but it is OK. If it is 4 SMALLINT values, then it is only 8 bytes (plus overhead) per row per copy of the PK. If it is a bunch of VARCHARs, then it could be much bulkier.
When should you deliberately add a surrogate id as the PK? In my experience, only about one-third of the cases. (Others will argue.) If you don't have any secondary keys, nor FKs referencing it, then the surrogate is a waste of space and speed. If you have only one secondary key or FK, then the required space is about the same. This last situation is what you described so far.
Table size: If you have a thousand rows, space is not likely to be an issue. A million rows might trigger thinking more seriously about space. For a billion rows, 'pull out all stops'.
PK tips: Don't include DATETIME or TIMESTAMP, someday there will need to be two rows with the same second. Don't put more columns in the PK than are needed for the implicit uniqueness constraint; if you do, you effectively lose that constraint. (There are exceptions.)

Perfect hash function generator for functions

I have a set of C++ functions. I want to map this functions in an hash table, something like: unordered_map<function<ReturnType (Args...)> , SomethingElse>, where SomethingElse is not relevant for this question.
This set of functions is previously known, small (let say less than 50) and static (is not gonna change).
Since lookup performance is crucial (should be performed in O(1)), I want to define a perfect hashing function.
There exists a perfect hash function generator for this scenario?
I know that there exists perfect hashing functions generators (like GPERF or CMPH) but since I've never used them, I don't know if they're suitable for my case.
REASON:
I'm trying to design a framework where, given a program written in C++, the user can select a subset F of the functions defined in this program.
For each f belonging to F, the framework implements a memoization strategy: when we call f with input i, we store (i,o) inside some data structure. So, if we are going to call AGAIN f with i, we will return o without performing again the (time expensive) computation.
The "already computed results" will be shared among different users (maybe on the cloud), so if user u1 has already computed o, user u2 will save computing time calling f with i (using the same annotation of before).
Obviously, we need to store the set of pairs (f,inputs_sets) (where inputs_sets is the already computed results set that I talked before), which is the original question: how do I do it?
So, using the "enumeration trick" proposed in the comments in this scenario could be a solution, assuming that the all the users use the exactly same enumeration, which could be a problem: supposing that our program has f1,f2,f3 what if u1 wants to memoize only f1 and f2 (so F={f1,f2}), while u2 wants to memoize only f3 (so F={f3})? An overkill solution could be to enumerate all the functions defined in the program, but this could generate an huge waste of memory.
Ok, maybe not what you want to hear but consider this: since you talk about a few functions, less than 50, the hash lookup should be negligible, even with collisions. Have you actually profiled and saw that the lookup is critical?
So my advise is to focus your energy on something else, most likely a perfect hash function would not bring any kind of improved performance in your case.
I am going to go one step further and say that I think that for less than 50 elements a flat map (good ol' vector) would have similar performance (or maybe even better due to cache locality). But again, measurements are required.

Container that allows fast search and order at the same time

I am getting in scenarios with this problem again and again and I implement different approaches every time. Now I decided to see if the stackoverflow community could suggest something better.
Let say that I have a reconcile API, where the current set of objects in a system need to be reevaluated - and this might take some time. (Note that obtaining the list of IDs of the objects is fast, the evaluation is slow.) It is public API, reconcile could be called irresponsibly. I would like to guarantee that every object in the system is reevaluated after the last call, while at the same time I do not want to reevaluate any object more than once without need. So far so good any set, ordered or unordered will do.
This additional requirement is the key: I would like to rotate the items to prevent in case of reconcile API misuse to reevaluating the same objects that sit on the "top".
... or if I have "A B C D E F" in the system at the first call, I will schedule: "A B C D E F" for reevaluation in this order.
Lets say that A B and C was already evaluated and there are new objects G and H in the system: The new queue should look like: "D E F A B C G H" where "D E F G H A B C" will be better, but it is not critical. I do not want the queue to be "A B C D E F G H" or "D E F A B C D E F G H"
The question is what stl or boost container (or combination) to use to solve this?
IMO the best approach is, if you need anything more complicated than vector, map, unordered_map, set, then you should just default to boost::multi_index_container. Multi index container has the advantage that it is extremely powerful and flexible and can efficiently support a wide variety of lookup schemes, and it's also quite easily extensible if your needs become greater later on. I would build the entire application that way first, then if you time things and find that you need to optimize, then try to replace relevant multi index containers with optimized data structures tailored to the particular operations you need to support. This saves you an incredible amount of development time fussing over data structure decisions.
A ring would be the proper data structure for this, but I don't know of any standard implementations. You can easily simulate it using by using a std::list by maintaining iterators to iterate, insert, and detect the end though.
std::list<Item>* is = new std::list<Item>();
auto it = is->begin();
auto ir = is->end();
is->insert(ir, i);
if (++it == is->end())
it = is->begin();
This gives O(1) insert and O(1) iteration. It adds in additional branch per iteration, but that could be eliminated with a proper ring.

sentiment analysis to find top 3 adjectives for products in tweets

there is a sentiment analysis tool to find out people's perception on social network.
This tool can:
(1) Decompose a document into a set of sentences.
(2) Decompose each sentence into a set of words, and perform filtering such that only
product name and adjectives are preserved.
e.g. "This MacBook is awesome. Sony is better than Macbook."
After processing, We can get:
{MacBook, awesome}
{Sony, better}. (not the truth :D)
We just assume there exists a list of product names, P, that we will ever
care, and there exist a list of adjectives, A, that we will ever care.
My questions are:
Can we reduce this problem into a specialized association rule mining
problem and how? If yes, anything need to be noticed like reduction, parameter
settings (minsup and minconf), additional constraints, and modication to the
Aprior algorithm to solve the problem.
Any way to artificially spam the result like adding "horrible" to the top 1 adjective? And any good ways to prevent this spam?
Thanks.
Have you considered counting?
For every product, count how often each adjective occurs.
Report the top-3 adjectives for each product.
Takes just one pass over your data, and does not use a lot of memory (unless you have millions of products to track).
There is no reason to use association rule mining. Association rule mining only pays off when you are looking for large itemsets (i.e. 4 or more terms) and they are equally important. If you know that one term is special (e.g. product name vs. adjectives), it makes sense to split the data set by this unique key, and then use counting.

Multi-directional hash table

I apologize in advance if this has been asked before. If it has I have no idea what this data structure would be called.
I have a collection of N (approx ~300 or less) widgets. Each widget has M (around ~10) different names. This collection will be populated once and then looked up many times. I will be implementing this in C++.
An example of this might be a collection of 200 people and storing their names in 7 different languages.
The lookup function would basically look like this:
lookup("name", A, B), which will return the translation of the name "name" from language A to language B, (only if name is in the collection).
Is there any known data structure in the literature for doing this sort of thing efficiently? The most obvious solution is to create a bunch of hash tables for the lookups, but having MxM hash tables for all the possible pairs quickly gets unwieldy and memory inefficient. I'd also be willing to consider sorted arrays (binary search) or even trees. Since the collection is not huge, log(N) lookups are just fine.
Thank you everyone!
Based on your description of the desired lookup function, it sounds like you could use a single hash table where the key is tuple<string, Language, Language> and the value is the result of the translation. The two languages in the key identify the source language of the string and the language of the desired translation, respectively.
Create an N-by-M array D, such that D[u,v] is the word in language v for widget u.
Also create M hash tables, H₀...Hₘ (where m is M-1) such that Hᵥ(w).data is u if w is the word for widget u in language v.
To perform lookup(w, r, s),
(1) set u = Hᵣ(w).data
(2) if D[u,r] equals w, return D[u,s], else return not-found.