Design for customizable string filter - c++

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:
data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00
For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.
I want to write an interface to do something like this:
dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");
// don't to the filtering, be lazy
cout << dataset.count(); // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS"); // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();
I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:
an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:
AbstractFilter::filter_run(string arg) {
decorator = factory.get_decorator_run(arg); // if arg is "> 00179691" returns FilterRunGreater(00179691)
return decorator(this);
}
proxy that build a regex to filter the filenames, but don't do the filtering
I'm also learning jQuery and I'm using a similar chaining mechanism.
Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.

I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).
As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.
But, lets say you read all the matching directory entries into an array A.
Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.
There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:
if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.
Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.
I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.

I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.
EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.

First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.
Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).
In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:
“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”
As you can see, this pattern describes the very chaining mechanism you are positing in your design.
Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.
All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.
It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.

I'd suggest starting with the boost iterator library - eg the filter iterator.
(And, of course, boost includes a very nice regex library.)

Related

sentiment analysis to find top 3 adjectives for products in tweets

there is a sentiment analysis tool to find out people's perception on social network.
This tool can:
(1) Decompose a document into a set of sentences.
(2) Decompose each sentence into a set of words, and perform filtering such that only
product name and adjectives are preserved.
e.g. "This MacBook is awesome. Sony is better than Macbook."
After processing, We can get:
{MacBook, awesome}
{Sony, better}. (not the truth :D)
We just assume there exists a list of product names, P, that we will ever
care, and there exist a list of adjectives, A, that we will ever care.
My questions are:
Can we reduce this problem into a specialized association rule mining
problem and how? If yes, anything need to be noticed like reduction, parameter
settings (minsup and minconf), additional constraints, and modication to the
Aprior algorithm to solve the problem.
Any way to artificially spam the result like adding "horrible" to the top 1 adjective? And any good ways to prevent this spam?
Thanks.
Have you considered counting?
For every product, count how often each adjective occurs.
Report the top-3 adjectives for each product.
Takes just one pass over your data, and does not use a lot of memory (unless you have millions of products to track).
There is no reason to use association rule mining. Association rule mining only pays off when you are looking for large itemsets (i.e. 4 or more terms) and they are equally important. If you know that one term is special (e.g. product name vs. adjectives), it makes sense to split the data set by this unique key, and then use counting.

Designing a REST hierarchy where there is duplicate data

We are having a debate on how to design REST endpoints. It basically comes down to this contrived example.
Say we have:
/netflix/movie/1/actors <- returns actors A, B and C
/netflix/movie/2/actors <- returns actors A, D, and E
Where the actor A is the same actor.
Now to get the biography of the actor which is "better" (yes, a judgement call):
/netflix/movie/1/actors/A
/netflix/movie/2/actors/A
or:
/actors/A
The disagreement ultimately stems from using Ember.js which expects a certain hierarchy -vs- the desire to not have multiple ways to access the same data (in the end it would truly be a small amount of code duplication). It is possible to map Ember.js to use the /actors/A so there is no strict technical limitation, this is really more of a philosophical question.
I have looked around and I cannot find any solid advice on this sort of thing.
I faced the same problem and went for option 2 (one "canonical" URI per resource) for the sake of simplicity and soundness (one type of resource per root).
Otherwise, when do you stop? Consider:
/actors/
/actors/A
/actors/A/movies
/actors/A/movies/1
/actors/A/movies/1/actors
/actors/A/movies/1/actors/B
...
I would, from an outsiders perspective, expect movies/1/actors/A to return information specific to that actor FOR that movie, whereas I would expect /actors/A to return information on that actor in general.
By analogy, I would expect projects/1/tasks/1/comments to return comments specific to the task - the highest level of the relationship via its url.
I would expect projects/1/comments to return comments related to the lower level project, or to aggregate all comments from the project.
The analogy isn't specific to the data in question, but I think it illustrates the point of url hierarchy leading to certain expectations about the data returned.
I would in this case clearly prefer /actors/A.
My reasoning is, that /movie/1/actors reports a list. This list, beeing a 1-n mapping between movie and actors, is not ment to be a path with further nodes. One simply does not expect to find actors in the movie tree.
You might one day implement /actors/A/movies returning 1 & 2, and this would make you implement URLs like /actors/A/movies/2 - and here you get recursion: movie/actor/movie/actor.
I´d prefer one single URL per object, and one clear spot where the 1-n mapping can be found.

Mapping vectors of arbitrary type

I need to store a list vectors of different types, each to be referenced by a string identifier. For now, I'm using std::map with std::string as the key and boost::any as it's value (example implementation posted here).
I've come unstuck when trying to run a method on all the stored vector, e.g.:
std::map<std::string, boost::any>::iterator it;
for (it = map_.begin(); it != map_.end(); ++it) {
it->second.reserve(100); // FAIL: refers to boost::any not std::vector
}
My questions:
Is it possible to cast boost::any to an arbitrary vector type so I can execute its methods?
Is there a better way to map vectors of arbitrary types and retrieve then later on with the correct type?
At present, I'm toying with an alternative implementation which replaces boost::any with a pointer to a base container class as suggested in this answer. This opens up a whole new can of worms with other issues I need to work out. I'm happy to go down this route if necessary but I'm still interested to know if I can make it work with boost::any, of if there are other better solutions.
P.S. I'm a C++ n00b novice (and have been spoilt silly by Python's dynamic typing for far too long), so I may well be going about this the wrong way. Harsh criticism (ideally followed by suggestions) is very welcome.
The big picture:
As pointed out in comments, this may well be an XY problem so here's an overview of what I'm trying to achieve.
I'm writing a task scheduler for a simulation framework that manages the execution of tasks; each task is an elemental operation on a set of data vectors. For example, if task_A is defined in the model to be an operation on "x"(double), "y"(double), "scale"(int) then what we're effectively trying to emulate is the execution of task_A(double x[i], double y[i], int scale[i]) for all values of i.
Every task (function) operate on different subsets of data so these functions share a common function signature and only have access to data via specific APIs e.g. get_int("scale") and set_double("x", 0.2).
In a previous incarnation of the framework (written in C), tasks were scheduled statically and the framework generated code based on a given model to run the simulation. The ordering of tasks is based on a dependency graph extracted from the model definition.
We're now attempting to create a common runtime for all models with a run-time scheduler that executes tasks as their dependencies are met. The move from generating model-specific code to a generic one has brought about all sorts of pain. Essentially, I need to be able to generically handle heterogenous vectors and access them by "name" (and perhaps type_info), hence the above question.
I'm open to suggestions. Any suggestion.
Looking through the added detail, my immediate reaction would be to separate the data out into a number of separate maps, with the type as a template parameter. For example, you'd replace get_int("scale") with get<int>("scale") and set_double("x", 0.2) with set<double>("x", 0.2);
Alternatively, using std::map, you could pretty easily change that (for one example) to something like doubles["x"] = 0.2; or int scale_factor = ints["scale"]; (though you may need to be a bit wary with the latter -- if you try to retrieve a nonexistent value, it'll create it with default initialization rather than signaling an error).
Either way, you end up with a number of separate collections, each of which is homogeneous, instead of trying to put a number of collections of different types together into one big collection.
If you really do need to put those together into a single overall collection, I'd think hard about just using a struct, so it would become something like vals.doubles["x"] = 0.2; or int scale_factor = vals.ints["scale"];
At least offhand, I don't see this losing much of anything, and by retaining static typing throughout, it certainly seems to fit better with how C++ is intended to work.

c++ design patterns for chaining together transformations of streams of objects

I'm working on a multithreaded library which monitors network traffic from winpcap and transforms the packets into several different types of data structures for consumption by various applications.
for each type of output, there will be several transformations required, each transformation could be described as taking 0-N objects of type X, and then producing 0-N types of Y which will then be consumed by the next step in the process.
It's important to note that in the transformation of X's to Y's. If we currently only have 5 (as an example) X's, that may or may not be enough to create a Y, or it might be enough to create many Y's, depending on the transformation and the data recieved.
To be consistant, we would obviously like to use a standard pattern for each transformation object. I'm hoping that someone could point out a commonly used pattern for something like this that hopefully relies on std (or boost) libraries.
Additionaly we have been discussing the possibility of using chains of inheritance to link the different layers together.
IE.
class ProcessXtoY: ProcessWtoX
{
void processData(iterator<X> begin, iterator<X> end)
{
/* create Y's, send output to */
}
virtual void processData(iterator<Y> begin, iterator<Y> end) = 0;
}
class ProcessYtoZ: ProcessXtoY
{
void processData(iterator<Y> begin, iterator<Y> end)
{
/* ... */
}
}
Can anyone suggest some examples of commonly used patterns for this type of project?
Using inheritance to link the transformations together is not what inheritance should be used for and pretty unflexible in adding new transformations. If you ever need new combinations of transformations (for example W directly to Y).
Instead, have you considered creating transformation class(es) that describe each transformation algorithm and then use std::transform and chain the transformations together?
The approach in your sample (I'd call it "iterate" approach) is an obvious strawman - you're pushing an infinite stream of packets, so there is no end() to them.
I think you can go with a pull or a push approaches. For pull, something along Java's
hasNext()/next(). Unfortunately, it's hard to branch, and the original data source needs to queue because we don't know when consumers will pick up the pockets.
For push approach you can use register(listener) and listener.process() combination. This one is easily branched and the buffering (in case process() at the network packet layer takes too long) can be done for you by the system, or you can introduce explicit queues at any level.
So overall, I'd recommend event listeners here.
I have a few suggestions. You could use a variation of the Decorator Pattern. You can modify this pattern so you chain together different object types. Then if you have different implementations of the same transform it is easy to problematically, or at runtime, swamp it out. There might be a pattern, already named, somewhere that is this variant, but you should be able to derive it from the basics of the Decorator Pattern.
If you want a multi-threaded solution I would recommend chaining together your transformations through a producer/consumer queue (see this). That way you could actually have several different consumers (transforms) work on in parallel and place the completed transforms onto the next producer/consumer in the line. Of course, this really only works when if ordering of your transforms don't matter for the rest of your program, or you have some way to keep track of that and reorder the final objects when they are needed again. Again, you can easily swap out your transforms programatically or at runtime if you have different implementations of them.
If you need something more generic and configurable you could use the Builder Pattern to encapsulate the chaining process and allow for complete runtime configuration of the builder and have finer control of swapping out the chaining process. Of course you would use other patterns within the builder to implement the transformation chains.

What is the best data structure to store FIX messages?

What's the best way to store the following message into a data structure for easy access?
"A=abc,B=156,F=3,G=1,H=10,G=2,H=20,G=3,H=30,X=23.50,Y=xyz"
The above consists of key/value pairs of the following:
A=abc
B=156
F=3
G=1
H=10
G=2
H=20
G=3
H=30
X=23.50
Y=xyz
The tricky part is the keys F, G and H. F indicates the number of items in a group whose item consists of G and H.
For example if F=3, there are three items in this group:
Item 1: G=1, H=10
Item 2: G=2, H=20
Item 3: G=3, H=30
In the above example, each item consists of two key/pair values: G and H. I would like the data structure to be flexible such that it can handle if the item increases its key/pair values. As much as possible, I would like to maintain the order it appears in the string.
UPDATE: I would like to store the key/value pairs as strings even though the value often appears as float or other data type, like a map.
May not be what you're looking for, but I'd simply recommend using QuickFIX (quickfixengine.org), which is a very high quality C++ FIX library. It has the type "FIX::Message" which does everything you're looking for, I believe.
I work with FIX a lot in Python an Perl, and I tend to use a dictionary or hash. Your keys should be unique within the message. For C++, you could look at std::map or STL extension std::hash_map.
If you have a subset of FIX messages you have to support (most exchanges usually use 10-20 types), you can roll your own classes to parse messages into. If you're trying to be more generic, I would suggest creating something like a FIXChunk class. The entirety of the message could be stored in this class, organized into keys and their values, as well as lists of repeating groups. Each of the repeating groups would itself be a FIXChunk.
A simple solution, but you could use a std::multimap<std::string,std::string> to store the data. That allows you to have multiple keys with the same value.
In my experience, fix messages are usually stored either in their original form (as a stream of bytes) or as a complex data structure providing a full APIs that can handle their intricacies. After all, a fix message can sometimes represent a tree of data.
The problem with the latter solution is that the transition is expensive in terms of computation cost in high-speed trading systems. If you are building a trading system, you may prefer to lazily calculate the parts of the fix message than you need, which is admittedly easier said than done.
I am not familiar with efficient open-source implementations; companies like the one I work for usually have proprietary implementations.