Testing collection equality with ordering - unit-testing

I'm writing simple test which checks method returning some of interface beneath Collection. I'm trying to abstract internal representation of this collection as much as possible, so that this test will pass in both cases: when method returns List and Set.
The Set is supposed to be ordered (LinkedHashSet or LinkedHashMapbacked Set) so I've got to test order too. So generally I'd like to write test like this:
assertThat(returnedList, containsOrdered('t1", "t2", "t3"));
which will fail iff both collections aren't "the same" (i.e. the same values in the same ordering).
I've found Hamcrest library to be useful in this case, however I'm stuck in it's documentation. Any help would be appreciated, however I'll try to avoid writing CollectionTestUtil or my own Hamcrest Matcher if it's possible.

You're nearly there.
assertThat(returnedList, contains("t1", "t2", "t3"))
will do it. Compare with containsInAnyOrder.

JUnit has the org.junit.Assert that contains multiple assertArrayEquals-implementations for different types, so you could do something like:
Collection<String> returnedList = new ArrayList<String>(); //Replace with call to whatever returns the ordered collection
Assert.assertArrayEquals(new Object[]{"t1", "t2", "t3"}, returnedList.toArray());

Related

Why Array.to_s is not recursive?

The Array.to_s uses inspect on its content, instead of calling to_s recursively.
This code
class Some
def to_s; "some" end
end
puts [Some.new].to_s
Would produce [#<Some:0x10078ce80>], instead of [some]
Wondering if there could be any bad consequences if I override Array.to_s to use to_s recursively? And why it's not working this way by default?
When you stringify a collection, you need every collection item to be self contained in order to distinguish between one and the other. #inspect takes care of that because it's supposed to stringify the value in an unambiguous way.
For example, if Array#inspect used #to_s on the items, the result of ["a", "b,c", 'd'].to_s would be "[a,b,c,d]" and you can't tell how many items are even in the collection or what types they have.

String concatenation from within a single list

Scala is new to me so I'm not sure the best way to go about this.
I need to simply take the strings within a single list and join them.
So, concat(List("a","b","c")) returns abc.
Should I first see how many strings there are in the list, that way I can just loop through and join them all? I feel like that needs to be done first, that way you can use the lists just like an array and do list[1] append list[2] append list[3], etc..
Edit:
Here's my idea, of course with compile errors..
def concat(l: List[String]): String = {
var len = l.length
var i = 0
while (i < len) {
val result = result :: l(i) + " "
}
result
}
How about this, on REPL
List("a","b","c") mkString("")
or in script file
List("a","b","c").mkString("")
Some options to explore for you:
imperative: for-loop; use methods from the List object to determine
loop length or use for-each List item
classical functional: recursive function, one element at the time using
higher-order functions: look at fold.
Given the basic level of the problem, I think you're looking at learning some fundamentals in programming. If the language of choice is Scala, probably the focus is on functional programming, so I'd put effort on solving #2, then solve #1. #3 for extra credits.
This exercise is designed to encourage you to think about the problem from a functional perspective. You have a set of data over which you wish to move, performing a set of identical operations. You've already identified the imperative, looping construct (for). Simple enough. Now, how would you build that into a functional construct, not relying on "stateful" looping?
In functional programming, fold ... is a family of higher-order
functions that iterate an arbitrary function over a data structure in
some order and build up a return value.
http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29
That sounds like something you could use.
As string concatenation is associative (to be exact, it forms a monoid having the empty String as neutral element), the "direction" of the fold doesn't matter (at least if you're not bothered by performance).
Speaking of performance: In real life, it would be a good idea to use a StringBuilder for the intermediate steps, but it's up to you if you want to use it.
A bit longer that mkString but more efficient:
s.foldLeft(new StringBuilder())(_ append _).toString()
I'm just assuming here that you are not only new to Scala, but also new to programming in general. I'm not saying SO is not made for newbies, but I'm sure there are many other places, which are better suited for your needs. For example books...
I'm also assuming that your problem doesn't have to be solved in a functional, imperative or some other way. It just has to be solved as a homework assignment.
So here are the list of things you should consider / ask yourself:
If you want to concat all elements of the list do you really need to know how many there are?
If you think you do, fine, but after having solved this problem using this approach try to fiddle around with your solution a little bit to find out if there is another way.
Appending the elements to a resulting list is a thought in right direction, but think about this: in addition to being object-oriented Scala is also a full-blown functional language. You might not know what this means, but all you need to know for now is this: it is pretty darn good with things like lists (LISP is the most known functional language and it stands for LISt Processing, which has to be an indication of some kind, don't you think? ;)). So maybe there is some magical (maybe even Scala idiomatic) way to accomplish such a concatination without defining the resulting list yourself.

Design for customizable string filter

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:
data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00
For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.
I want to write an interface to do something like this:
dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");
// don't to the filtering, be lazy
cout << dataset.count(); // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS"); // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();
I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:
an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:
AbstractFilter::filter_run(string arg) {
decorator = factory.get_decorator_run(arg); // if arg is "> 00179691" returns FilterRunGreater(00179691)
return decorator(this);
}
proxy that build a regex to filter the filenames, but don't do the filtering
I'm also learning jQuery and I'm using a similar chaining mechanism.
Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.
I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).
As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.
But, lets say you read all the matching directory entries into an array A.
Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.
There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:
if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.
Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.
I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.
I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.
EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.
First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.
Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).
In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:
“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”
As you can see, this pattern describes the very chaining mechanism you are positing in your design.
Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.
All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.
It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.
I'd suggest starting with the boost iterator library - eg the filter iterator.
(And, of course, boost includes a very nice regex library.)

Unit Testing a large method

Following Test-Driven Development that is.
I've recently implemented a algorithm (A*) that required a clean interface. By clean all I want is a couple of properties and a single search method.
What I've found hard is testing the search method. It contains around five steps but I'm essentially forced to code this method in one big go which makes things hard.
Is there any advice for this?
Edit
I'm using C#. No I don't have the code at hand at the moment. My problem relies in the fact a test only passes after implementing the whole search method - rather than a step in the algorithm. I naturally refactored the code after but it was implementing I found hard.
If your steps are large enough (or are meaningful in their own right) you should consider delegating them to other smaller classes and test the interaction between your class and them. For example if you have a parsing step followed by a sorting step followed by a searching step it could be meaningful to have a parser class, a sorter class, etc. You would then use TDD on each of those.
No idea which language you're using, but if you're in the .net world you could make these classes internal and then expose them to your test class with "internals visible to" which would keep them hidden.
If the steps are small AND meaningless on their own then tvanfosson's suggestions are the way to go.
Refactor your big method into smaller, private methods. Use reflection, or another mechanism available in your language, to test the smaller methods independently. Worst case -- if you don't have access to reflection or friends, make the methods protected and have your test class inherit from the main class so it can have access to them.
Update: I should also clarify that simply refactoring to private methods doesn't necessarily imply that you need to create tests specific to those methods. If you are thoroughly testing the public methods that rely on the private methods, you may not need to test the private methods directly. This is probably the general case. There are times when it does make sense to test private methods directly (say when it simplifies or reduces the number of test cases needed for public methods), but I wouldn't consider it a requirement to create tests simply because you refactor to a private method implementation.
An important thing to remember when employing TDD is that tests don't need to live forever.
In your example you know you're going to provide a clean interface, and much of the operation will be delegated to private methods. I think it's the assumption that these methods have to be created as private, and stay that way, that causes the most bother. Also the assumption that the tests must stay around once you've developed them.
My advice for this scenario would be to:
Sketch the skeleton of the large method in terms of new methods, so you have a series of steps which make sense to test individually.
Make these new methods public by default, and begin to develop them using TDD.
Then once you have all the parts tested, write another which tests the entire method. Ensuring that if any of the smaller, delegated parts are broken this new test will fail.
At this point the big method works, and is covered by tests, you can now begin to refactor. I'd recommend, as FinnNk does, trying to extract the smaller methods into classes of their own. If this doesn't make sense, or if it would take too much time, I'd recommend something others may not agree with: change the small methods into private ones and delete the tests for them.
The result is that you have used TDD to create the entire method, the method is still covered by the tests, and the API is exactly the one you want to present.
A lot of confusion comes from the idea that once you've written a test, you always have to keep it, and that is not necessarily the case. However, if you are sentimental, there are ways unit tests for private methods can be achieved in Java, perhaps there are similar constructs in C#.
I'm assuming that A* means the search algorithm (e.g. http://en.wikipedia.org/wiki/A*_search_algorithm). If so, I understand your problem as we have similar requirements. Here's the WP algorithm, and I'll comment below:
Pseudo code
function A*(start,goal)
closedset := the empty set % The set of nodes already evaluated.
openset := set containing the initial node % The set of tentative nodes to be evaluated.
g_score[start] := 0 % Distance from start along optimal path.
h_score[start] := heuristic_estimate_of_distance(start, goal)
f_score[start] := h_score[start] % Estimated total distance from start to goal through y.
while openset is not empty
x := the node in openset having the lowest f_score[] value
if x = goal
return reconstruct_path(came_from,goal)
remove x from openset
add x to closedset
foreach y in neighbor_nodes(x)
if y in closedset
continue
tentative_g_score := g_score[x] + dist_between(x,y)
if y not in openset
add y to openset
tentative_is_better := true
elseif tentative_g_score < g_score[y]
tentative_is_better := true
else
tentative_is_better := false
if tentative_is_better = true
came_from[y] := x
g_score[y] := tentative_g_score
h_score[y] := heuristic_estimate_of_distance(y, goal)
f_score[y] := g_score[y] + h_score[y]
return failure
function reconstruct_path(came_from,current_node)
if came_from[current_node] is set
p = reconstruct_path(came_from,came_from[current_node])
return (p + current_node)
else
return the empty path
The closed set can be omitted (yielding a tree search algorithm) if a solution is guaranteed to exist, or if the algorithm is adapted so that new nodes are added to the open set only if they have a lower f value than at any previous iteration.
First, and I'm not being frivolous, it depends whether you understand the algorithm - it sounds as if you do. It would also be possible to transcribe the algorithm above - hoping it worked) and give it a number of tests. That's what I would do as I suspect that the authors of WP are better than me!. The large-scale tests would exercise edge cases such as no node, one node, two nodes+no edge, etc... If they all passed I would sleep happy. But if they failed there is no choice but to understand the algorithm.
If so I think you have to construct tests for the data structures. These are (at least) set, distance, score, etc. You have to create these objects and test them. What is the expected distance for case 1,2,3... write tests. What is the effect of adding A to set Z? needs a test. For this algorithm you need to test heuristic_estimate_of_distance and so on. It's a lot of work.
One approach may be to find an implementation in another language and interrogate it to find the values in the data structures. Of course if you are modifying the algorithm you are on your own!
There's one thing even worse than this - numerical algorithms. Diagonalizing matrices - do we actually get the right answers. I worked with one scientist writing 3rd derivative matrices - it would terrify me...
You might consider Texttest (http://texttest.carmen.se/) as a way of testing "under the interface".
It allows you to check behavior by examining logged data to verify behavior, rather than purely black-box-style testing on arguments and method results.
DISCLAIMER: I've heard a presentation on Texttest, and looked at the docs, but haven't yet had the time to try it out in a serious application.

Why use tuples instead of objects?

The codebase where I work has an object called Pair where A and B are the types of the first and second values in the Pair. I find this object to be offensive, because it gets used instead of an object with clearly named members. So I find this:
List<Pair<Integer, Integer>> productIds = blah();
// snip many lines and method calls
void doSomething(Pair<Integer, Integer> id) {
Integer productId = id.first();
Integer quantity = id.second();
}
Instead of
class ProductsOrdered {
int productId;
int quantityOrdered;
// accessor methods, etc
}
List<ProductsOrderded> productsOrdered = blah();
Many other uses of the Pair in the codebase are similarly bad-smelling.
I Googled tuples and they seem to be often misunderstood or used in dubious ways. Is there a convincing argument for or against their use? I can appreciate not wanting to create huge class hierarchies but are there realistic codebases where the class hierarchy would explode if tuples weren't used?
First of all, a tuple is quick and easy: instead of writing a class for every time you want to put 2 things together, there's a template that does it for you.
Second of all, they're generic. For example, in C++ the std::map uses an std::pair of key and value. Thus ANY pair can be used, instead of having to make some kind of wrapper class with accessor methods for every permutation of two types.
Finally, they're useful for returning multiple values. There's really no reason to make a class specifically for a function's multiple return values, and they shouldn't be treated as one object if they're unrelated.
To be fair, the code you pasted is a bad use of a pair.
Tuples are used all the time in Python where they are integrated into the language and very useful (they allow multiple return values for starters).
Sometimes, you really just need to pair things and creating a real, honest to god, class is overkill. One the other hand, using tuples when you should really be using a class is just as bad an idea as the reverse.
The code example has a few different smells:
Reinventing the wheel
There is already a tuple available in the framework; the KeyValuePair structure. This is used by the Dictionary class to store pairs, but you can use it anywhere it fits. (Not saying that it fits in this case...)
Making a square wheel
If you have a list of pairs it's better to use the KeyValuePair structure than a class with the same purpose, as it results in less memory allocations.
Hiding the intention
A class with properties clearly shows what the values mean, while a Pair<int,int> class doesn't tell you anything about what the values represent (only that they are probably related somehow). To make the code reasonably self explanatory with a list like that you would have to give the list a very desciptive name, like productIdAndQuantityPairs...
For what its worth, the code in the OP is a mess not because it uses tuples, but because the values in the tuple are too weakly typed. Compare the following:
List<Pair<Integer, Integer>> products_weak = blah1();
List<Pair<Product, Integer>> products_strong = blah2();
I'd get upset too if my dev team were passing around IDs rather than class instances around, because an integer can represent anything.
With that being said, tuples are extremely useful when you use them right:
Tuples exist to group ad hoc values together. They are certainly better than creating excessive numbers of wrapper classes.
Useful alternative to out/ref parameters when you need to return more than one value from a function.
However, tuples in C# make my eyes water. Many languages like OCaml, Python, Haskell, F#, and so on have a special, concise syntax for defining tuples. For example, in F#, the Map module defines a constructor as follows:
val of_list : ('key * 'a) list -> Map<'key,'a>
I can create an instance of a map using:
(* val values : (int * string) list *)
let values =
[1, "US";
2, "Canada";
3, "UK";
4, "Australia";
5, "Slovenia"]
(* val dict : Map<int, string> *)
let dict = Map.of_list values
The equilvalent code in C# is ridicuous:
var values = new Tuple<int, string>[] {
new Tuple<int, string>(1, "US"),
new Tuple<int, string>(2, "Canada"),
new Tuple<int, string>(3, "UK"),
new Tuple<int, string>(4, "Australia"),
new Tuple<int, string>(5, "Slovenia")
}
var dict = new Dictionary<int, string>(values);
I don't believe there is anything wrong with tuples in principle, but C#'s syntax is too cumbersome to get the most use out of them.
It's code reuse. Rather than writing Yet Another Class With Exactly The Same Structure As The Last 5 Tuple-Like Classes We Made, you make... a Tuple class, and use that whenever you need a tuple.
If the only significance of the class is "to store a pair of values", then I'd say using tuples is an obvious idea. I'd say it was a code smell (as much as I hate the term) if you started implementing multiple identical classes just so that you could rename the two members.
This is just prototype-quality code that likely was smashed together and has never been refactored. Not fixing it is just laziness.
The real use for tuples is for generic functionality that really doesn't care what the component parts are, but just operates at the tuple level.
Scala has tuple-valued types, all the way from 2-tuples (Pairs) up to tuples with 20+ elements. See First Steps to Scala (step 9):
val pair = (99, "Luftballons")
println(pair._1)
println(pair._2)
Tuples are useful if you need to bundle together values for some relatively ad hoc purpose. For example, if you have a function that needs to return two fairly unrelated objects, rather than create a new class to hold the two objects, you return a Pair from the function.
I completely agree with other posters that tuples can be misused. If a tuple has any sort of semantics important to your application, you should use a proper class instead.
The obvious example is a coordinate pair (or triple). The labels are irrelevant; using X and Y (and Z) is just a convention. Making them uniform makes it clear that they can be treated in the same way.
Many things have already been mentioned, but I think one should also mention, that there are some programming styles, that differ from OOP and for those tuples are quite useful.
Functional programming languages like Haskell for example don't have classes at all.
If you're doing a Schwartzian transform to sort by a particular key (which is expensive to compute repeatedly), or something like that, a class seems to be a bit overkill:
val transformed = data map {x => (x.expensiveOperation, x)}
val sortedTransformed = transformed sort {(x, y) => x._1 < y._1}
val sorted = sortedTransformed map {case (_, x) => x}
Having a class DataAndKey or whatever seems a bit superfluous here.
Your example was not a good example of a tuple, though, I agree.