Testing important implementation details - unit-testing

I'm implementing a key -> value associative container. Internally it's a sorted binary tree, and I want to make sure it's balanced so that find operations are sure to be Olog(n). The problem is that this is an implementation detail that is entirely private to the class, and I can't readily measure it from outside.
The best I can think to do is to benchmark my find operations - if they operate in linear time it's probably because the tree is unbalanced - but that seems far too inexact, and I'd feel better if I had a more direct way to measure.
What design/testing patterns are out there that might be helpful in these sorts of situations?

You could extract the balanced tree to it's own class, and test that class. In that class the balanced-ness is a feature of the class and could expose something like depth which would let you inspect it and assert that the tree remains balanced.

You are correct in saying that you are testing an implementation detail. The problem here is that a bad implementation also produces the correct output, it just takes longer. This means that the only measurable unit is time.
What I would do is similar to what you propose: create a big collection of data and structure it in a way that a good implementation should be able to find what you're looking for in a matter of moments and a bad implementation has to go through your entire collection before finding it.
This could translate to having thousands of elements and searching for the element that's last in line. You could structure it in a way that a good implementation should find it at the top of the tree and thus find it very quickly while a bad implementation should find it somewhere at the bottom, thus taking time to find it.
Many frameworks have an option to specify a timeout so if you set this to a low enough value and you have plenty of data in your collection, you can weed out slow-running implementations like that.

Related

Compare two QAbstractItemModels

I'm trying to figure out an efficient algorithm that takes in two QAbstractItemModels (trees) (A,B) and computes the differences between them, such that I get a list of Items that are not present in A (but are in B - added), or items that have been modified / deleted.
The only current way I can think of is doing a Breadth search of A for every item item in B. But this doesn't seem very efficient. Any ideas are welcome.
Have you tried using magic?
Seriously though, this is a very broad question, especially if we consider the fact it is an QAbstractItemModels and not a QAbstractListModel. For a list it would be much simpler, but an abstract item model implements a tree structure so there are a lot of variables.
do you check for total item count
do you check for item count per level
do you check if item is contained in both models
if so, is it contained at the same level
if so, is it contained at the same index
is the item in its original state or has it been modified
You need to make all those considerations and come up with an efficient solution. And don't expect it will as simple as a "by the book algorithm". Good news, since you are dealing with isolated items, it will be easier than trying to do that for text, and in the case of the latter, you can't hope to get anywhere nearly as concise as with isolated items. I've had my fair share of absurdly mindless github diff results.
And just in case that's your actual goal, it will be much easier to achieve by tracking the history of the derived data set than doing a blind comparison. Tracking history is much easier if you want to establish what is added, what is deleted, what is moved and what is modified. Because it will consider the actual event flow rather than just the end result comparison. Especially if you don't have any persistent ID scheme implemented. Is there a way to tell if item X has been deleted or moved to a new level/index and modified and stuff like that.
Also, worry about efficiency only after you have empirically established a performance issue. Some algorithms may seem overly complex, but modern machines are overly fast, and unless you are running that in a tight loop you shouldn't really worry about it. In the end, it doesn't boil down to how complex it is, it boils down to whether it is fast enough or not.

Managing large spatial data set with attributes in C++

I have a data set with about 700 000 entries, and each entry is a set of 3D coordinates with attributes such as name, timestamp, ID, and so on.
Right now I'm just reading the coordinates and render them as points in OpenGL. However I want to associate each point with its corresponding attributes and I want to be able to sort and pick them during runtime based on their attributes. How would I go about to achieve this in an efficient manner?
I know I can put I can put the data in a struct and use stl sort for sorting, but is that a good design choice or is there a more efficient/elegant way of handling the problem?
The way I tend to look at these design choices is to first use one of the standard library containers (btw, if you need to "just" do lookup you don't necessarily have to sort, but you need a container that allows lookup), then check if this an "efficient enough" solution for the problem.
You can usually come up with a custom solution that is more efficient and maybe more elegant but you tend to run into two issues with that:
1) You end up having to implement some type of a container, which will cost you time both in implementation and debugging compared to a well understood and tested container that is already out there. Most of the time you're better off trying to solve the problem at hand rather than make it bigger by adding more code.
2) If someone else will have to maintain your code at some point, chances are they are familiar with standard library components both from a design and implementation perspective, but they won't be familiar with your custom container, thus increasing the learning curve.
If you consider each attribute of your point class as a component of a vector, then your selection process is a region query. Your example of a string attribute being equal to something means that the region is actually a line in your data space. However, there won't be any sorting made on other attributes within that selection, you will have to implement it by yourself, but it should be relatively straightforward for octrees, which partition data in ordered regions.
As advocated in another answer, try existing standard solutions first. If you can find an of the shelf implementation of one of these data structures:
R-tree
KD tree
BSP
Octree, or more likely, a n dimensional version of the quadtree or octree principle (I will use the term octree herein to denote the general data structure)
then go for it. These are the data structures I recommend for spatial data management.
You could also use an embedded RDBMS capable of working with spatial data (they usually implement R-tree for spatial indexing), but it may not be interesting if your dataset isn't dynamic.
If your dataset falls within the 10000 entries range, then by today standards it isn't that large, so using simpler structures should suffice. In that perimeter, I would go first for a simple std::vector, and use std::sort and std::find to filter the data in smaller set and sort it afterward.
I would probably try an ordered set or map on the most queried attribute in a second attempt, then do some benchmarks to pick the more performing solution.
For a more efficient one dimensional indexing algorithm (in essence, that`s what sets and maps are), you might want to try B-trees: there's C++ implementation available from google.
My third attempt would go toward an OpenCL solution (although if you are doing heavy OpenGL rendering, you might prefer doing the work on the CPU instead, but that depends on your framerate needs).
If your dataset is much larger, as it seems to be, then consider one of the more complex solutions I listed initially.
At any rate, without more details about your dataset and how you plan to use it, it will be difficult to provide a good solution, so the only real advice we can give is: try everthing you can and benchmark.
If you're dealing with point clouds, take a look at PCL, it could save you a lot of time and effort without having to dig into the intricacies of spatial indexing yourself. It also includes visualisation.

C/C++ graph interface for representation of partial order

In my code I use a class which represents a directed acyclic graph. I wrote the code myself, it wasn't hard. But later I realized my app has more requirements: the graph must be transitive-reduced, i.e. unique representation of a partial ordrer. Every time the user does drag-n-drop or cut/copy/paste on the visual GUI representation of the graph, it has to be validated and adapted to this requirement. Now things become more complicated. So I did plan how to perform all graph operations safely, etc., but before I really dive into the code, I'd like to know:
Is there a known C/C++ interface for partial orders? (Preferably C++)
I found many many libraries for graphs, but I already have my simple acyclic digraph code. I couldn't find anything which deals specifically with transitively-reduced graphs (I don't need an adjacency matrix, the data comes from the user so it would be inefficient here... It's a small graph for user data, not something for mathematical use)
I'm looking for an interface which automatically detects unnecessary connections and removes them, does tests to see if a node copy/move operation would be valid partial-order-wise, i.e. preserve the properties of a partial order, etc.
I would recommend adding a partial-order validation method. When an edit is being made, make a copy of the whole graph apply the edit to one copy, then validate it. If it passes, keep the modified copy. If it doesn't pass, revert to the saved copy.
Perhaps the validator could find all bottom nodes, for each one, build a multiset of its ancestors (or descendants if you call them that) and check for duplicate entries. I would revert to recursion for the search if you expect only small graphs.
As far as I know, usually programs have their own graph classes when used for non-mathematical purposes. This happens because graphs may be much more complicated than linear containers such as the STL containers (vector, list, etc.).
Since you don't have any special needs in the field of math or algorithms (a search algorithm in your case would be a simple loop, in most cases you don't need more than that, and certainly not in the case of (premature) optimization). If you do, you have boost::graph, but I suspect it would complicate things more than help you.
So I say, write a good graph/node class, and if it's good enough and written for general-purpose, we can all benefit from that. Nobody is answering the question because there's really no existing public code which matches your needs. Write good libre code once, and it can then be used everywhere. Good luck.
P.S your own search algorithm may be much faster than ones written for general-purpose graph libraries, e.g. boost::graph, because you can take an advantage of the known restrictions and rules of you specific graph, thus making seraches much faster. For example, in a transitively-reduced graph, if A is a parent of B, then A cannot also have b as a non-child decendant (e.g. grand-child), so you can optimize your search using this knowledge. The price you pay is doing lots of tests when changing the graph, but you gain a lot back because searching/scanning can become much faster.

Search structure with history (persistence)

I need a map-like data structure (in C++) for storing pairs (Key,T) with the following functionality:
You can insert new elements (Key,T) into the current structure
You can search for elements based on Key in the current structure
You can make a "snapshot" of the current version of the structure
You can switch to one of the versions of the structures which you took the snapshot of and continue all operations from there
Completely remove one of the versions
What I don't need
Element removal from the structure
Merging of different versions of the structure into one
Iteration over all (or some of) elements currently stored in the structure
In other words, you have some search structure that you can build up, but at any point you can jump in history, and expand the earlier/different version of the structure in a different way. Later on you may jump between those different versions.
In my project, Key and T are likely to be integers or pointer values, but not strings.
The primary objective is to reduce the time complexity; space consumption is secondary (but should be reasonable as well). To clarify, for me log(N)+log(S) (where N-number of elements, S-number of snapshots) would be enough, although faster is better :)
I have some rough idea how to implement it --- for example: being the structure a binary search tree, the insertion of a new element can clone the path from the root to the insertion location, while keeping the rest of the tree intact. Switching tree versions would be equivalent to picking a different version of the root node, for which some changes are simply not visible.
However, to make this custom tree efficient (e.g. self-balancing) it will require some additional effort and careful coding. Of course I can do it myself but perhaps there are already existing libraries to do exactly that?
Also, there is probably a proper name for this kind of data structure that I simply don't know, making my Google searches (or SO searches) total failures...
Thank you for your help!
I think what you are looking for is an immutable map. Functional (or functionally inspired) programming languages (such as Haskell or Scala) have immutable versions of most of the containers you'd find in the STL. Operations such as insertion/removal etc. then return a copy of the map (preserving the original) with the copy containing your requested modification. A lot of work has gone into designing the datastructures so that the copies are able to point to as much of the original datastructure as possible to reduce time and memory complexity of each operation.
You can find a lot more details in a book such as this one: http://www.amazon.co.uk/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.
While searching for some persistent search trees libraries I stumbled on this:
http://cg.scs.carleton.ca/~dana/pbst/
While it does not have the exact same functionality as needed, it seems pretty close to it. I will investigate.
(posting here, as someone may find it useful as well)

Efficient Dictionary lookup

For my C++ application, there is a requirement to check if a word is a valid English dictionary word or not. What is the best way to do it. Is there freely available dictionary that I can make use of. I just need a collection of all possible words. How to make this lookup least expensive. Do I need to hash it.
Use either a std::set<std::string> or a std::unordered_set<std::string>. The latter is new in C++0x and may or may not be supported by your C++ Standard Library implementation; if it does not support it, it may include a hash_set of some kind: consult your documentation to find out.
Which of these (set, which uses a binary search tree, and unordered_set, which uses a hashtable) is more efficient depends on the number of elements you are storing in the container and how your Standard Library implementation implements them. Your best bet is to try both and see which performs better for your specific scenario.
Alternatively, if the list of words is fixed, you might consider using a sorted std::vector and using std::binary_search to find words in it.
With regards to the presence of a word list, it depends on the platform.
Under Linux, /usr/share/dict/words contains a list of English words
that might meet your needs. Otherwise, there are doubtlessly such lists
available on the network.
Given the size of such lists, the most rapid access will be to load it
into a hash table. std::unsorted_set, if you have it; otherwise, many
C++ compilers come with a hash_set, although different compilers have
a slightly different interface for it, and put it in different
namespaces. If that still has performance problems, it's possible to do
better if you know the number of entries in advance (so the table never
has to grow), and implement the hash table in an std::vector (or even a
C style array); handling collisions will be a bit more complicated,
however.
Another possibility would be a trie. This will almost certainly result
in the least number of basic operations in the lookup, and is fairly
simple to implement. Typical implementations will have very poor
locality, however, which could make it slower than some of the other
solutions in actual practice (or not—the only way to know is to
implement both and measure).
I actually did this a few months ago, or something close to this. You can probably find one online for free.
Like on this website: http://wordlist.sourceforge.net/
Just put it in a text file, and compare words with what is on the list. It should be order n with n being the number of words in the list. Do you need the time complexity faster?
Hope this helps.