Cycles in family tree software

Cycles in family tree software - c++

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am the developer of some family tree software (written in C++ and Qt). I had no problems until one of my customers mailed me a bug report. The problem is that the customer has two children with their own daughter, and, as a result, he can't use my software because of errors.
Those errors are the result of my various assertions and invariants about the family graph being processed (for example, after walking a cycle, the program states that X can't be both father and grandfather of Y).
How can I resolve those errors without removing all data assertions?

It seems you (and/or your company) have a fundamental misunderstanding of what a family tree is supposed to be.
Let me clarify, I also work for a company that has (as one of its products) a family tree in its portfolio, and we have been struggling with similar problems.
The problem, in our case, and I assume your case as well, comes from the GEDCOM format that is extremely opinionated about what a family should be. However this format contains some severe misconceptions about what a family tree really looks like.
GEDCOM has many issues, such as incompatibility with same sex relations, incest, etc... Which in real life happens more often than you'd imagine (especially when going back in time to the 1700-1800).
We have modeled our family tree to what happens in the real world: Events (for example, births, weddings, engagement, unions, deaths, adoptions, etc.). We do not put any restrictions on these, except for logically impossible ones (for example, one can't be one's own parent, relations need two individuals, etc...)
The lack of validations gives us a more "real world", simpler and more flexible solution.
As for this specific case, I would suggest removing the assertions as they do not hold universally.
For displaying issues (that will arise) I would suggest drawing the same node as many times as needed, hinting at the duplication by lighting up all the copies on selecting one of them.

Relax your assertions.
Not by changing the rules, which are mostly likely very helpful to 99.9% of your customers in catching mistakes in entering their data.
Instead, change it from an error "can't add relationship" to a warning with an "add anyway".

Here's the problem with family trees: they are not trees. They are directed acyclic graphs or DAGs. If I understand the principles of the biology of human reproduction correctly, there will not be any cycles.
As far as I know, even the Christians accept marriages (and thus children) between cousins, which will turn the family tree into a family DAG.
The moral of the story is: choose the right data structures.

I guess that you have some value that uniquely identifies a person on which you can base your checks.
This is a tricky one. Assuming you want to keep the structure a tree, I suggest this:
Assume this: A has kids with his own daughter.
A adds himself to the program as A and as B. Once in the role of father, let's call it boyfriend.
Add a is_same_for_out() function which tells the output generating part of your program that all links going to B internally should be going to A on presentation of data.
This will make some extra work for the user, but I guess IT would be relatively easy to implement and maintain.
Building from that, you could work on code synching A and B to avoid inconsistencies.
This solution is surely not perfect, but is a first approach.

You should focus on what really makes value for your software. Is the time spent on making it work for ONE consumer worth the price of the license ? Likely not.
I advise you to apologize to this customer, tell him that his situation is out of scope for your software and issue him a refund.

You should have set up the Atreides family (either modern, Dune, or ancient, Oedipus Rex) as a testing case. You don't find bugs by using sanitized data as a test case.

This is one of the reasons why languages like "Go" do not have assertions. They are used to handle cases that you probably didn't think about, all too often. You should only assert the impossible, not simply the unlikely. Doing the latter is what gives assertions a bad reputation. Every time you type assert(, walk away for ten minutes and really think about it.
In your particularly disturbing case, it is both conceivable and appalling that such an assertion would be bogus under rare but possible circumstances. Hence, handle it in your app, if only to say "This software was not designed to handle the scenario that you presented".
Asserting that your great, great, great grandfather being your father as impossible is a reasonable thing to do.
If I was working for a testing company that was hired to test your software, of course I would have presented that scenario. Why? Every juvenile yet intelligent 'user' is going to do the exact same thing and relish in the resulting 'bug report'.

I hate commenting on such a screwed up situation, but the easiest way to not rejigger all of your invariants is to create a phantom vertex in your graph that acts as a proxy back to the incestuous dad.

So, I've done some work on family tree software. I think the problem you're trying to solve is that you need to be able to walk the tree without getting in infinite loops - in other words, the tree needs to be acyclical.
However, it looks like you're asserting that there is only one path between a person and one of their ancestors. That will guarantee that there are no cycles, but is too strict. Biologically speaking, descendancy is a directed acyclic graph (DAG). The case you have is certainly a degenerate case, but that type of thing happens all the time on larger trees.
For example, if you look at the 2^n ancestors you have at generation n, if there was no overlap, then you'd have more ancestors in 1000 AD than there were people alive. So, there's got to be overlap.
However, you also do tend to get cycles that are invalid, just bad data. If you're traversing the tree, then cycles must be dealt with. You can do this in each individual algorithm, or on load. I did it on load.
Finding true cycles in a tree can be done in a few ways. The wrong way is to mark every ancestor from a given individual, and when traversing, if the person you're going to step to next is already marked, then cut the link. This will sever potentially accurate relationships. The correct way to do it is to start from each individual, and mark each ancestor with the path to that individual. If the new path contains the current path as a subpath, then it's a cycle, and should be broken. You can store paths as vector<bool> (MFMF, MFFFMF, etc.) which makes the comparison and storage very fast.
There are a few other ways to detect cycles, such as sending out two iterators and seeing if they ever collide with the subset test, but I ended up using the local storage method.
Also note that you don't need to actually sever the link, you can just change it from a normal link to a 'weak' link, which isn't followed by some of your algorithms. You will also want to take care when choosing which link to mark as weak; sometimes you can figure out where the cycle should be broken by looking at birthdate information, but often you can't figure out anything because so much data is missing.

Another mock serious answer for a silly question:
The real answer is, use an appropriate data structure. Human genealogy cannot fully be expressed using a pure tree with no cycles. You should use some sort of graph. Also, talk to an anthropologist before going any further with this, because there are plenty of other places similar errors could be made trying to model genealogy, even in the most simple case of "Western patriarchal monogamous marriage."
Even if we want to ignore locally taboo relationships as discussed here, there are plenty of perfectly legal and completely unexpected ways to introduce cycles into a family tree.
For example: http://en.wikipedia.org/wiki/Cousin_marriage
Basically, cousin marriage is not only common and expected, it is the reason humans have gone from thousands of small family groups to a worldwide population of 6 billion. It can't work any other way.
There really are very few universals when it comes to genealogy, family and lineage. Almost any strict assumption about norms suggesting who an aunt can be, or who can marry who, or how children are legitimized for the purpose of inheritance, can be upset by some exception somewhere in the world or history.

Potential legal implications aside, it certainly seems that you need to treat a 'node' on a family tree as a predecessor-person rather than assuming that the node can be the-one-and-only person.
Have the tree node include a person as well as the successors - and then you can have another node deeper down the tree that includes the same person with different successors.

A few answers have shown ways to keep the assertions/invariants, but this seems like a misuse of assertions/invariant. Assertions are to make sure something that should be true is true, and invariants are to make sure something that shouldn't change doesn't change.
What you're asserting here is that incestuous relationships don't exist. Clearly they do exist, so your assertion is invalid. You can work around this assertion, but the real bug is in the assertion itself. The assertion should be removed.

Your family tree should use directed relations. This way you won't have a cycle.

Genealogical data is cyclic and does not fit into an acyclic graph, so if you have assertions against cycles you should remove them.
The way to handle this in a view without creating a custom view is to treat the cyclic parent as a "ghost" parent. In other words, when a person is both a father and a grandfather to the same person, then the grandfather node is shown normally, but the father node is rendered as a "ghost" node that has a simple label like ("see grandfather") and points to the grandfather.
In order to do calculations you may need to improve your logic to handle cyclic graphs so that a node is not visited more than once if there is a cycle.

The most important thing is to avoid creating a problem, so I believe that you should use a direct relation to avoid having a cycle.
As #markmywords said, #include "fritzl.h".
Finally I have to say recheck your data structure. Maybe something is going wrong over there (maybe a bidirectional linked list solves your problem).

Assertions don't survive reality
Usually assertions don't survive the contact with real world data. It's a part of the process of software engineering to decide, with which data you want to deal and which are out of scope.
Cyclic family graphs
Regarding family "trees" (in fact it are full blown graphs, including cycles), there is a nice anecdote:
I married a widow who had a grown daughter. My father, who often visited us, fell in love with my step-daughter and married her. As a result, my father became my son, and my daughter became my mother. Some time later, I gave my wife a son, who was the brother of my father, and my uncle. My father's wife (who is also my daughter and my mother) got a son. As a result, I got a brother and a grandson in the same person. My wife is now my grandmother, because she is my mother's mother. So I am the husband of my wife, and at the same time the step-grandson of my wife. In other words, I'm my own grandpa.
Things get even more strange, when you take surrogates or "fuzzy fatherhood" into account.
How to deal with that
Define cycles as out-of-scope
You could decide that your software should not deal with such rare cases. If such a case occurs, the user should use a different product. This makes dealing with the more common cases much more robust, because you can keep more assertions and a simpler data model.
In this case, add some good import and export features to your software, so the user can easily migrate to a different product when necessary.
Allow manual relations
You could allow the user to add manual relations. These relations are not "first-class citizens", i.e. the software takes them as-is, doesn't check them and doesn't handle them in the main data model.
The user can then handle rare cases by hand. Your data model will still stay quite simple and your assertions will survive.
Be careful with manual relations. There is a temptation to make them completely configurable and hence create a fully configurable data model. This will not work: Your software will not scale, you will get strange bugs and finally the user interface will become unusable. This anti-pattern is called "soft coding", and "The daily WTF" is full of examples for that.
Make your data model more flexible, skip assertions, test invariants
The last resort would be making your data model more flexible. You would have to skip nearly all assertions and base your data model on a full blown graph. As the above example shows, it is easily possible to be your own grandfather, so you can even have cycles.
In this case, you should extensively test your software. You had to skip nearly all assertions, so there is a good chance for additional bugs.
Use a test data generator to check unusual test cases. There are quick check libraries for Haskell, Erlang or C. For Java / Scala there are ScalaCheck and Nyaya. One test idea would be to simulate a random population, let it interbreed at random, then let your software first import and then export the result. The expectation would be, that all connections in the output are also in the input and vice verse.
A case, where a property stays the same is called an invariant. In this case, the invariant is the set of "romantic relations" between the individuals in the simulated population. Try to find as much invariants as possible and test them with randomly generated data. Invariants can be functional, e.g.:
an uncle stays an uncle, even when you add more "romantic relations"
every child has a parent
a population with two generations has at least one grand-parent
Or they can be technical:
Your software will not crash on a graph up to 10 billion members (no matter how many interconnections)
Your software scales with O(number-of-nodes) and O(number-of-edges^2)
Your software can save and re-load every family graph up to 10 billion members
By running the simulated tests, you will find lots of strange corner cases. Fixing them will take a lot of time. Also you will lose a lot of optimizations, your software will run much slower. You have to decide, if it is worth it and if this is in the scope of your software.

Instead of removing all assertions, you should still check for things like a person being his/her own parent or other impossible situations and present an error. Maybe issue a warning if it is unlikely so the user can still detect common input errors, but it will work if everything is correct.
I would store the data in a vector with a permanent integer for each person and store the parents and children in person objects where the said int is the index of the vector. This would be pretty fast to go between generations (but slow for things like name searches). The objects would be in order of when they were created.

Duplicate the father (or use symlink/reference).
For example, if you are using hierarchical database:
$ #each person node has two nodes representing its parents.
$ mkdir Family
$ mkdir Family/Son
$ mkdir Family/Son/Daughter
$ mkdir Family/Son/Father
$ mkdir Family/Son/Daughter/Father
$ ln -s Family/Son/Daughter/Father Family/Son/Father
$ mkdir Family/Son/Daughter/Wife
$ tree Family
Family
└── Son
├── Daughter
│ ├── Father
│ └── Wife
└── Father -> Family/Son/Daughter/Father
4 directories, 1 file

Related

How to select the right architectural/design patterns

I am doing my own research project, and I am quite struggling regarding the right choice of architectural/design patterns.
In this project, after the "system" start, I need to do something in background (tasks, processing, display data and so on) and at the same be able to interact with the system using, for example, keyboard and send some commands, like "give me status of this particular object" or "what is the data in this object".
So my question is - what software architectural/design patterns can be applied to this particular project? How the interraction between classes/objects should be organized? How should the objects be created?
Can, for example, "event-driven architecture" or "Microkernel" be applied here? Some references to useful resources will be very much appreciated!
Thank you very much in advance!

Careful with design patterns. If you sprinkle them throughout your code hoping that everything will work great, you'll soon have an unreadable, boilerplate full mess. They are recipes, not solutions.
My advice to you is pick a piece of paper and a pencil and start drawing all the entities of your domain, with all their requisites, and see how they relate. If you want to get somewhat serious about it, you can do something like this.
When defining your entities, strive for high cohesion and loose coupling.
High cohesion means that you should keep similar functionalities together. In a very simple example, if you have a class that reads stuff from a file and processes it, the class has low cohesion, since reading and processing are two very distinct functionalities. In this case, you would want a class for each functionality.
As for loose coupling, it means that your entities should be independent of each other. Using the example above, supposed that you are now the proud owner of two highly cohesive classes - one that reads stuff from a file (Reader), and one that processes that stuff (Processor). Now, suppose that the Processor class has an instance of the Reader class, and calls it in order to get its input. In this case, we can say that both classes are tightly coupled, since Processor won't work without Reader. In the OOP world, the solution for this is typically the use of interfaces. You can find a neat example here.
After defining an initial model of your domain and gathering as much knowledge about it as you can, you can now start to think about the implementation's architecture. This is were you can start thinking about the architectural patterns. Event driven architecture, clean architecture, MVP, MVVM... It will all depend on your domain. It is your job to know which pattern will fit best. Spoiler alert: this can be extremely hard to do correctly even for experienced engineers, so don't be afraid to fail.
Finally, leave the design patterns for the implementation stage. Their use completely depends on your implementation problems and decisions. Also, DON'T FORCE THEM. Ideally, you will solve a problem and, IF APPLICABLE, you'll see a pattern emerging. Trust me, the last thing you want is to have a case of design patternitis. Anyway, if you need literature on patterns, I totally recommend this book. It's great no matter your level as an engineer.
Further reading:
SOLID principles
Onion Architecture
Clean architecture
Good luck!

You have a background task, and it can be used for a message pump/event queue indeed. Then your foreground task would send requests to this background thread and asynchronously wait for the result.
Have a look at the book "Patterns for Parallel Programming".

It is much better if you check a book for Design Patterns. I really like this one.
For example, if you need to get some data from a particular object, you may need the Observer Pattern to work for you and as soon as the object has the data, you (or another object) get to know this data and can work with it, with another pattern (strategy might work, it really depends on what you have to do).
If you have to do some things at the same time, check also the Singleton pattern (well, check the most important ones!).

Testing approach for algorithms with complex outputs

How to test a result of a program that is basically a black box? For example one year ago I had to write a B tree as a homework and I really struggled with testing the correctness. What strategies do you use in such scenarios? Visualization? Robust input-->result sets of testing data? What do you do when it is hard to get such data because the only way how to get them is your proper working program?
EDIT: I think that my question was misunderstood. There was no problem with understanding how B tree works. That is trivial. But writing robust tests for validating its proper functionality is not so trivial. I think that this school problem is similar to many practical REAL word scenarios and test cases. And sometimes understanding the domain is quite different from delivering working and correct program...
EDIT2: And yes, with B tree it is possible to validate proper behavior with pen and paper. But this is really dirty and not fun :) This is not working well with problems that requires huge amount of data for their validation...

I'm not sure these answers really capture the problem at hand. A B-tree's input and output aren't any different from those of any other dictionary---but the algorithm performs better, if it's implemented correctly. It's only really got two functions to test (add, and find) so theoretically, "black-box" testing of this single component should be fine. Designing for testability isn't the issue, since no matter how you do it the whole algorithm will be one component.
So the question is: when you have to implement subtle algorithms, the kinds with complicated output that you can't always understand in your head so well, how do you test them? I think there are three different strategies you can use:
Black-box test basic functionality. For the B-tree case, this is things like cwash suggested, and also, things like making sure that when you add an item, you can then find it, etc.
Test certain invariants that your algorithm should maintain (the B-tree should be balanced, values within nodes should be sorted, etc.)
A few, small "pencil-and-paper" tests may be necessary -- work the algorithm out by hand and check that it matches what your code does. But the big-data tests can all be of type 2. These can also be brittle, so unless you need to be really sure about your algorithm, you may want to avoid them.

If you do not grasp the problem at hand, how can you develop a solution to it? My suggestion would be to understand the domain enough to be able to work out the problem on paper and ensure that your program matches.

Consult with an expert on the subject.
I know if I have a convoluted procedure I'm trying to fix, I have no idea what the output should be after my changes, so I need to consult a fellow developer with more knowledge of the business need, and they are able to verify what I've done is correct.

I would focus on constructing test cases that exercise the functionality of your B-tree algorithm. I haven't looked at it for years, but I'm fairly sure you'll be able to find a documented sequence of steps to insert a set of values in a specific order, then validate that the leaf nodes are as they should be. If you construct your testing along these lines, you should be able to prove your implementation is correct.

The key is to know there is a balance between testing something to death and doing tests that adequately cover what should be covered. Edge cases, e.g null inputs or checking inputs are numeric by testing an alphabet character or a punctuation character, are likely most of the tests you'd need. To complement this there may be one or two common cases to handle to show the program can handle a non-edge case as well. To cover all valid input in most programs is overkill and would result in an overwhelmingly large amount of tests.

I think the answer to the question you're asking boils down to designing for testability. Often you get a testable design for free when you test-drive the development of the solution. But let's face it, when you're implementing a highly mathematical algorithm, this just doesn't fall out.
To make sure you have a testable design, you need to understand what a seam is. Then you need to know a few rules of thumb, such as avoiding statics, using polymorphism, and properly decomposing problems and separating concerns.
Watch "The Clean Code Talks -- Unit Testing" by Misko Hevery, I think it will help you wrap your head around it.

Try looking at it from a requirements point of view, rather than an implementation point of view. Before you write code, you must understand exactly what you want it to do.
Testing and requirements should be a matching pair. If you're having trouble defining tests, maybe it's because the requirements are not well-defined. That in turn implies that you may have bugs that aren't so much implementation bugs, but "lack of clear requirements" bugs. The code writer in that case would be working to a mental list of requirements that he/she thinks is requirements, but can't be sure, and they're not written down for independent understanding and verification.
I've struggled with software where the requirements weren't clear, because the customer couldn't even tell us what they wanted. But when we delivered to them, they sure could tell us then what they didn't like about it! A big part of software engineering is getting the requirements right before the coding begins. This is true on the high-level (overall product, with requirement input from customer) and also the smaller level (modules, individual functions, where requirements are internally defined by software team or individuals). It is still true to some degree I think for iterative development, although the high-level requirements are more fluid.

#Bystrik Jurina,
I often get involved in projects which involve conversions between disparate data formats. Most answers have focused on testing a B-tree or similar algorithm, but it seems that you're looking for a more general answer.
Most of my work is based on the command line. It may sounds like a contradiction, but one of the first tools I use is visualization. I'll write some methods to write out my data structures in a format that's easy to consume. This can (and usually does) include something that's visually clear. But often it also means something that I could easily parse with a smaller test program, or even import into Excel.
I'll start by focusing on the basic outline, and write a program that does the bare minimum of what I need to accomplish. If it's a multi-step process, this might mean implementing one step at a time and validating the results of each step before moving on. Or writing something that works only in specific cases, and then expanding the set of cases where it's expected to work. At first you can validate that the code works in the limited set of cases, such as for known input data. As the project moves forward, you can start logging warnings for cases you might not have tested, or for unexpected types of input data. This has drawbacks, but is a nice approach when you're dealing with a known set of input data
Validation techniques can include formal test cases, or informal programs that work to challenge your assumptions. It could mean writing a basic driver program to exercise the "core" routines. A good example would be to add a record to a database, then read it back and compare the original object against the one loaded from the database.
If you have trouble wrapping your head around the way a program functions, think about what it needs to accomplish. It might be easier to writing code that tests the way different inputs produce different outputs. Producing visualizations is a good help, because the act of deciding how to display the data can make you think about different conditions and focus in on the most critical parts of your data structures.
Often I've found that building a visualization brings me to admit that the way the data is being stored just isn't very clear. For a B-tree, the representation isn't very flexible. But for other cases, you may be using parallel arrays when a nested tree of objects would be more natural.

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.
However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.
Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?

I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:
Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).

When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.
Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.

Well, there are a few answers.
First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.
The other one is to break down every component of your program into testable parts.
If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.
Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.
Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.
And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.

Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."
In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."
And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."
If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?
So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.
For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.
Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.

This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:
Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!

Ultimately, you have to decide what your program should be doing, and then test for that.

Efficient Methods for a Life Simulation

Having read up on quite a few articles on Artificial Life (A subject I find very interesting) along with several questions right here on SO, I've begun to toy with the idea of designing a (Very, very, very) simple simulator. No graphics required, even. If I've overlooked a question, please feel free to point it out to me.
Like I said, this will hardly be a Sims level simulation. I believe it will barely reach "acceptable freeware" level, it is simply a learning exercise and something to keep my skills up during a break. The basic premise is that a generic person is created. No name, height or anything like that (Like I said, simple), the only real thing it will receive is a list of "associations" and generic "use", "pick up" and "look" abilities.
My first question is in regards to the associations. What does SO recommend as an efficient way to handle such things? I was thinking a multimap, with the relatively easy set up of the key being what it wants (Food, eat, rest, et cetera) and the other bit (Sorry, my mind has lapsed) being what it associates with that need.
For example, say we have a fridge. The fridge contains food (Just a generic base object). Initially the person doesn't associate fridge with food, but it does associate food with hunger. So when its hunger grows it begins to arbitrarily look for food. If no food is within reach it "uses" objects to find food. Since it has no known associations with food it uses things willy-nilly (Probably looking for the nearest object and expanding out). Once it uses/opens the fridge it sees food, making the connection (Read: inserting the pair "food, fridge") that the fridge contains food.
Now, I realise this will be far more complex than it appears, and I'm prepared to hammer it out. The question is, would a multimap be suitable for a (Possibly) exponentially expanding list of associations? If not, what would be?
The second question I have is probably far easier. Simply put, would a generic object/item interface be suitable for most any item? In other words, would a generic "use" interface work for what I intend? I don't think I'm explaining this well.
Anyway, any comments are appreciated.

If you were doing this as a hard-core development project, I'd suggest using the equivalent of Java reflection (substitute the language of your choice there). If you want to do a toy project as a starter effort, I'd suggest at least rolling your own simple version of reflection, per the following rationale.
Each artifact in your environment offers certain capabilities. A simple model of that fact is to ask what "verbs" are applicable to each object your virtual character encounters (including possible dependence on the current state of that object). For instance, your character can "open" a refrigerator, a box of cereal, or a book, provided that each of them is in a "closed" state. Once a book is opened, your character can read it or close it. Once a refrigerator is opened, your character can "look-in" it to get a list of visible contents, can remove an object from it, put an object in it, etc.
The point is that a typical situation might involve your character looking around to see what is visible, querying an object to determine its current state or what can be done with it (i.e. "what-state" and "what-can-i-do" are general verbs applicable to all objects), and then use knowledge about its current state, the state of the object, and the verb list for that object to try doing various things.
By implementing a set of positive and negative feedback, over time your character can "learn" under what circumstances it should or should not engage in different behaviors. (You could obviously make this simulation interactive by having it ask the user to participate in providing feedback.)
The above is just a sketch, but perhaps it can give you some interesting ideas to play with. Have fun! ;-)

To the first question:
My understanding is that you have one-to-possibly-many relationship. So yes, a multimap seems appropriate to me.
To the second question:
Yes, I believe a generic interface for objects is appropriate. Perhaps use something similar to REST to manipulate object state.

I heard a podcast a while back with the developer of The Noble Ape Simulation, which may be interesting to you - conceptwise and perhaps codewise, as you can access the source code, as well as download binaries.
The podcast was FLOSS Weekly 31 with Randal Schwartz and Leo Laporte.

Life with lisp(sbcl) :)

Need refactoring ideas for Arrow Anti-Pattern

I have inherited a monster.
It is masquerading as a .NET 1.1 application processes text files that conform to Healthcare Claim Payment (ANSI 835) standards, but it's a monster. The information being processed relates to healthcare claims, EOBs, and reimbursements. These files consist of records that have an identifier in the first few positions and data fields formatted according to the specs for that type of record. Some record ids are Control Segment ids, which delimit groups of records relating to a particular type of transaction.
To process a file, my little monster reads the first record, determines the kind of transaction that is about to take place, then begins to process other records based on what kind of transaction it is currently processing. To do this, it uses a nested if. Since there are a number of record types, there are a number decisions that need to be made. Each decision involves some processing and 2-3 other decisions that need to be made based on previous decisions. That means the nested if has a lot of nests. That's where my problem lies.
This one nested if is 715 lines long. Yes, that's right. Seven-Hundred-And-Fif-Teen Lines. I'm no code analysis expert, so I downloaded a couple of freeware analysis tools and came up with a McCabe Cyclomatic Complexity rating of 49. They tell me that's a pretty high number. High as in pollen count in the Atlanta area where 100 is the standard for high and the news says "Today's pollen count is 1,523". This is one of the finest examples of the Arrow Anti-Pattern I have ever been priveleged to see. At its highest, the indentation goes 15 tabs deep.
My question is, what methods would you suggest to refactor or restructure such a thing?
I have spent some time searching for ideas, but nothing has given me a good foothold. For example, substituting a guard condition for a level is one method. I have only one of those. One nest down, fourteen to go.
Perhaps there is a design pattern that could be helpful. Would Chain of Command be a way to approach this? Keep in mind that it must stay in .NET 1.1.
Thanks for any and all ideas.

I just had some legacy code at work this week that was similar (although not as dire) as what you are describing.
There is no one thing that will get you out of this. The state machine might be the final form your code takes, but thats not going to help you get there, nor should you decide on such a solution before untangling the mess you already have.
First step I would take is to write a test for the existing code. This test isn't to show that the code is correct but to make sure you have not broken something when you start refactoring. Get a big wad of data to process, feed it to the monster, and get the output. That's your litmus test. if you can do this with a code coverage tool you will see what you test does not cover. If you can, construct some artificial records that will also exercise this code, and repeat. Once you feel you have done what you can with this task, the output data becomes your expected result for your test.
Refactoring should not change the behavior of the code. Remember that. This is why you have known input and known output data sets to validate you are not going to break things. This is your safety net.
Now Refactor!
A couple things I did that i found useful:
Invert if statements
A huge problem I had was just reading the code when I couldn't find the corresponding else statement, I noticed that a lot of the blocks looked like this
if (someCondition)
{
100+ lines of code
{
...
}
}
else
{
simple statement here
}
By inverting the if I could see the simple case and then move onto the more complex block knowing what the other one already did. not a huge change, but helped me in understanding.
Extract Method
I used this a lot.Take some complex multi line block, grok it and shove it aside in it's own method. this allowed me to more easily see where there was code duplication.
Now, hopefully, you haven't broken your code (test still passes right?), and you have more readable and better understood procedural code. Look it's already improved! But that test you wrote earlier isn't really good enough... it only tells you that you a duplicating the functionality (bugs and all) of the original code, and thats only the line you had coverage on as I'm sure you would find blocks of code that you can't figure out how to hit or just cannot ever hit (I've seen both in my work).
Now the big changes where all the big name patterns come into play is when you start looking at how you can refactor this in a proper OO fashion. There is more than one way to skin this cat, and it will involve multiple patterns. Not knowing details about the format of these files you're parsing I can only toss around some helpful suggestions that may or may not be the best solutions.
Refactoring to Patterns is a great book to assist in explainging patterns that are helpful in these situations.
You're trying to eat an elephant, and there's no other way to do it but one bite at a time. Good luck.

A state machine seems like the logical place to start, and using WF if you can swing it (sounds like you can't).
You can still implement one without WF, you just have to do it yourself. However, thinking of it like a state machine from the start will probably give you a better implementation then creating a procedural monster that checks internal state on every action.
Diagram out your states, what causes a transition. The actual code to process a record should be factored out, and called when the state executes (if that particular state requires it).
So State1's execute calls your "read a record", then based on that record transitions to another state.
The next state may read multiple records and call record processing instructions, then transition back to State1.

One thing I do in these cases is to use the 'Composed Method' pattern. See Jeremy Miller's Blog Post on this subject. The basic idea is to use the refactoring tools in your IDE to extract small meaningful methods. Once you've done that, you may be able to further refactor and extract meaningful classes.

I would start with uninhibited use of Extract Method. If you don't have it in your current Visual Studio IDE, you can either get a 3rd-party addin, or load your project in a newer VS. (It'll try to upgrade your project, but you will carefully ignore those changes instead of checking them in.)
You said that you have code indented 15 levels. Start about 1/2-way out, and Extract Method. If you can come up with a good name, use it, but if you can't, extract anyway. Split in half again. You're not going for the ideal structure here; you're trying to break the code in to pieces that will fit in your brain. My brain is not very big, so I'd keep breaking & breaking until it doesn't hurt any more.
As you go, look for any new long methods that seem to be different than the rest; make these in to new classes. Just use a simple class that has only one method for now. Heck, making the method static is fine. Not because you think they're good classes, but because you are so desperate for some organization.
Check in often as you go, so you can checkpoint your work, understand the history later, be ready to do some "real work" without needing to merge, and save your teammates the hassle of hard merging.
Eventually you'll need to go back and make sure the method names are good, that the set of methods you've created make sense, clean up the new classes, etc.
If you have a highly reliable Extract Method tool, you can get away without good automated tests. (I'd trust VS in this, for example.) Otherwise, make sure you're not breaking things, or you'll end up worse than you started: with a program that doesn't work at all.
A pairing partner would be helpful here.

Judging by the description, a state machine might be the best way to deal with it. Have an enum variable to store the current state, and implement the processing as a loop over the records, with a switch or if statements to select the action to take based on the current state and the input data. You can also easily dispatch the work to separate functions based on the state using function pointers, too, if it's getting too bulky.

There was a pretty good blog post about it at Coding Horror. I've only come across this anti-pattern once, and I pretty much just followed his steps.

Sometimes I combine the state pattern with a stack.
It works well for hierarchical structures; a parent element knows what state to push onto the stack to handle a child element, but a child doesn't have to know anything about its parent. In other words, the child doesn't know what the next state is, it simply signals that it is "complete" and gets popped off the stack. This helps to decouple the states from each other by keeping dependencies uni-directional.
It works great for processing XML with a SAX parser (the content handler just pushes and pops states to change its behavior as elements are entered and exited). EDI should lend itself to this approach too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js