How do I name my model if the table describes a journal (accounting)? - ruby-on-rails-4

Model name is usually a singular name of some entity. And table name is a plural form of the word.
For example Transaction is stored in transactions table.
But there are cases when a whole table is described by singular word, means a scope of entities. For example: journal, log, history.
And there no more precise name for a single row besides "entry" or "item". But model named ThingsJounralEntry looks messy, and simple ThingsJournal is confusing, because an instance doesn't describe any actual journal, but single entry.
Is there a common naming approach for such cases better than described above?

Your question seems to show that there are actually two naming issues in dance. One is regarding the elements of your collection, which you ask explicitly. The other one is regarding the collection itself, which is rather implicit. In fact, when you refer to Journal you feel compelled to clarify (accounting). This means that your collection class would be better named AccountingJournal, which would remove the ambiguity.
Now, since the description you provide of these objects (collection and elements) is a little bit succinct, I don't have enough information as to suggest an appropriate name. However, in order to give you some hints I would recommend considering not just the nature of the elements, but the responsibility they will have in your model. Will they represent entities or actions? If the elements are entities (things) consider names that denote simple nouns that would replicate or be familiar with the language used by accountants. Examples include AccountingEntry or AccountingRecord. If your elements represent actions then use a suffix that stresses such characteristic, for example, AccountingAnnotation or AccountingRegistration. Another question you can ask yourself is what kind of messages these elements will receive? For instance, if they will represent additions and subtractions of money you might want to use AccountingOperation or AccountChange.
Whatever the situation dictated by the domain you should verify that your naming conventions sound as sentences said by actual domain experts: "[this is an] accounting record of [something]" or "add this accounting record to the journal" or "register this accounting operation in the journal" or even "this is an accounting operation for subtracting this amount of money," etc.
The (intellectual) activity of naming you objects is directly connected with the activity of shaping your model. Exercise your prospective model in your head by saying aloud the messages that they would typically receive and check that the language you are shaping closely reproduces the language of the domain. In other words, let your objects talk and hear what they say, they will tell you their names.

Related

Partition key consisting of multiple values in DynamoDB standard attribute name and value formatting?

I am creating a DDB table which has multiple values make up its partition key and sort key. The primary key is a composite of the partition and sort key.
The partition key would be something like region+date+location and the sort key would be zone+update timestamp millis.
What's the norm for naming these attributes? Is it just naming out the values like region+date+location ? Or some other kind of delimitation? I've also read that it might be better to be generic and just name it something like partitionKey and rangeKey or <typeofthing>id etc. but I've gotten a little pushback on this from my team that the names aren't helpful in that case.
I can't seem to find best practices for this specific question anywhere? Is there a preferred approach for this written down somewhere that I could point to?
There is no "standard" way of naming attributes. But there are two things to consider against mandating a naming standard like region+date+location:
A very long attribute name is wasteful - you need to send it over the network when writing and reading items, and it is included in the item length for which you pay per operation and for storage. I'm not saying it means you should name your attributes "a" and "b", but try not to go overboard on the other direction either.
An attribute name region+date+location implies that this attribute contains only this combination, and will forever contain only this combination. But often in DynamoDB the same attribute name is reused for multiple different types - this reuse is the hallmark of "single table design". That being said, these counterexamples aren't too relevant to your use case, because the attributes overloaded in this way are usually not the key columns as in is your use case.
In your case I think that whatever you decide will be fine. There is no compelling reason to stay away from one of the options you mentioned.

Whats the correct normalization of a relationship of three tables?

I have three tables:
teachers
classes
courses
The sentences is:
A teacher may teachs one or more courses.
A teacher may teachs one or more classes.
A teacher teachs a course for a class.
So I need a fourth table with PRIMARY KEY of each of the three tables composing the PRIMARY KEY or UNIQUE INDEX of the fourth table.
What is the correct normalization for this?
The name of the table: "class_course_teacher" should be ok or I need to use a name like "assignments" for this?
The primary key of the table: "class_id + course_id + teacher_id" should be ok or I need to create "id" for assignments table and this should have "class_id + course_id + teacher_id" as unique index?
Normalization starts with functional dependencies. It's a method of breaking a set of information into elementary facts without losing information. Think of it as logical refactoring.
Your sentences are a start but not sufficient to determine the dependencies. Do courses have one or more teachers? Do classes have one or more teachers? Do courses have one or more classes? Do classes belong to one or more courses? Can teachers "teach" courses without classes (i.e. do you want to record which teachers can teach a course before any classes are assigned)? Do you want to define classes or courses before assigning teachers?
Your two questions don't relate to normalization. assignments is a decent name, provided you won't be recording other assignments (like homework assignments), in which case teacher_assignments or class_assignments might be better. class_course_teacher could imply that there can only be one relationship involving those three entity sets, but it can and does happen that different relationships involve the same entity sets.
I advise against using surrogate ids until you have a reason to use them. They increase the number of columns and indices required without adding useful information, and can increase the number of tables that need to be joined in a query (since you need to "dereference" the assignment_id to get to the class_id, course_id and teacher_id) unless you record redundant data (which has its own problems).
Normalization is about data structures - but not about naming. If the only requirement is "correct normalization", the both decisions are up to you.
Nevertheless good names are important in the real world. I like "assignments" - it is very meaningful.
I like to create ID-columns for each table, they make it easier to change relationships between tables afterwards.

sentiment analysis to find top 3 adjectives for products in tweets

there is a sentiment analysis tool to find out people's perception on social network.
This tool can:
(1) Decompose a document into a set of sentences.
(2) Decompose each sentence into a set of words, and perform filtering such that only
product name and adjectives are preserved.
e.g. "This MacBook is awesome. Sony is better than Macbook."
After processing, We can get:
{MacBook, awesome}
{Sony, better}. (not the truth :D)
We just assume there exists a list of product names, P, that we will ever
care, and there exist a list of adjectives, A, that we will ever care.
My questions are:
Can we reduce this problem into a specialized association rule mining
problem and how? If yes, anything need to be noticed like reduction, parameter
settings (minsup and minconf), additional constraints, and modication to the
Aprior algorithm to solve the problem.
Any way to artificially spam the result like adding "horrible" to the top 1 adjective? And any good ways to prevent this spam?
Thanks.
Have you considered counting?
For every product, count how often each adjective occurs.
Report the top-3 adjectives for each product.
Takes just one pass over your data, and does not use a lot of memory (unless you have millions of products to track).
There is no reason to use association rule mining. Association rule mining only pays off when you are looking for large itemsets (i.e. 4 or more terms) and they are equally important. If you know that one term is special (e.g. product name vs. adjectives), it makes sense to split the data set by this unique key, and then use counting.

Designing a REST hierarchy where there is duplicate data

We are having a debate on how to design REST endpoints. It basically comes down to this contrived example.
Say we have:
/netflix/movie/1/actors <- returns actors A, B and C
/netflix/movie/2/actors <- returns actors A, D, and E
Where the actor A is the same actor.
Now to get the biography of the actor which is "better" (yes, a judgement call):
/netflix/movie/1/actors/A
/netflix/movie/2/actors/A
or:
/actors/A
The disagreement ultimately stems from using Ember.js which expects a certain hierarchy -vs- the desire to not have multiple ways to access the same data (in the end it would truly be a small amount of code duplication). It is possible to map Ember.js to use the /actors/A so there is no strict technical limitation, this is really more of a philosophical question.
I have looked around and I cannot find any solid advice on this sort of thing.
I faced the same problem and went for option 2 (one "canonical" URI per resource) for the sake of simplicity and soundness (one type of resource per root).
Otherwise, when do you stop? Consider:
/actors/
/actors/A
/actors/A/movies
/actors/A/movies/1
/actors/A/movies/1/actors
/actors/A/movies/1/actors/B
...
I would, from an outsiders perspective, expect movies/1/actors/A to return information specific to that actor FOR that movie, whereas I would expect /actors/A to return information on that actor in general.
By analogy, I would expect projects/1/tasks/1/comments to return comments specific to the task - the highest level of the relationship via its url.
I would expect projects/1/comments to return comments related to the lower level project, or to aggregate all comments from the project.
The analogy isn't specific to the data in question, but I think it illustrates the point of url hierarchy leading to certain expectations about the data returned.
I would in this case clearly prefer /actors/A.
My reasoning is, that /movie/1/actors reports a list. This list, beeing a 1-n mapping between movie and actors, is not ment to be a path with further nodes. One simply does not expect to find actors in the movie tree.
You might one day implement /actors/A/movies returning 1 & 2, and this would make you implement URLs like /actors/A/movies/2 - and here you get recursion: movie/actor/movie/actor.
I´d prefer one single URL per object, and one clear spot where the 1-n mapping can be found.

Design for customizable string filter

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:
data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00
For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.
I want to write an interface to do something like this:
dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");
// don't to the filtering, be lazy
cout << dataset.count(); // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS"); // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();
I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:
an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:
AbstractFilter::filter_run(string arg) {
decorator = factory.get_decorator_run(arg); // if arg is "> 00179691" returns FilterRunGreater(00179691)
return decorator(this);
}
proxy that build a regex to filter the filenames, but don't do the filtering
I'm also learning jQuery and I'm using a similar chaining mechanism.
Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.
I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).
As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.
But, lets say you read all the matching directory entries into an array A.
Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.
There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:
if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.
Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.
I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.
I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.
EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.
First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.
Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).
In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:
“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”
As you can see, this pattern describes the very chaining mechanism you are positing in your design.
Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.
All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.
It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.
I'd suggest starting with the boost iterator library - eg the filter iterator.
(And, of course, boost includes a very nice regex library.)