Generating a Provable List of Sets of Scenarios - state

I'm asking this with full knowledge that this idea is probably well covered in a subject unfamiliar to me. Suppose you're writing a small piece of code that takes an input of an arbitrary number of variables. Those variables can have several states, namely:
Correct Data
Incorrect Data (outside range, improper formatting, whatever)
Unknown (Null)
So if we have 3 input variables, and 3 states per those variables, we end up with 27 possible scenarios. Suppose I have to do some logic based on the state of certain variables, or the combination of states (AND, NAND, OR, etc). Can I easily structure a program in such a way that I provably cover all scenarios without an absolute mess of if/else style logic? The first thing that came to mind was statemachines, but after looking at them for a bit I'm not entirely convinced it's the same thing.

There will be if style logic, but you can use karnaugh maps to make it much cleaner and be sure that you've covered every possibility. What you do, is you make a grid showing every possible combination of states. Then, mark each state in a different way depending on the way you want to react to it. The goal of this is to group states. Then, you can easily see if your groups of states are logically "close together," and if so, you can simplify your control logic. A quick search for karnaugh maps will bring up explanations that will be much easier to follow thanks to pictures, but the idea is to use the grid to see which variables are irrelevant to a group of states, and optimize them out of the logic.


Data structure for optimization

I am thinking about a method to handle the data more efficiently. Let me explain it:
Currently, there is a class, called Rules, it has a lot of member functions, like Rules::isForwardEligible(), Rules::isCurrentNumberEligible()....So these functions are used to check the specific situations (when other process call them), all of them return bool value.
In the body of these functions are ifs which will query the DB to compare data, finally return turn or false.
So the whole thing is like if(Rules::isCurrentNumberEligible())--->Check content in Rules::isCurrentNumberEligible()--->if(xxxx)(xxxx will be another function again, query DB), I think this kind way is not good. I want to improve it.
What I am imagining, is to use less code but query more for the information.
So I can query in the first step if(Rules::isCurrentNumberEligible()), I can set different tables for query, so the things like if(xxx){if(xx){if(xx)....}} will be less. A solutions is to build a class whose role is like a coordinator, ask him each time for different querys. Is it suitable?
I am not sure it is a good way to control this, or may be there are some good solutions aside. Please help me, thanks!
The classical algorithm for rule-based systems is the RETE algorithm. It strives to minimize the number of rules to be evaluated. The trick is that a re-evaluation of a rule does not make sense unless at least one related fact has changed.
In general, those rules should be queried first which promise maximum information gain. This helps to pin-down the respective case in as few questions as possible.
A physician in differential diagnosis would always order his/her questions from general to specific. In information theory this is called the principle of maximum entropy.

Hardcoding Parameters vs. loading from a file

I am working on a motion control system, and will have at least 5 motors, each with parameters such as "gearbox ratio", "ticks per rev" "Kp", "Ki", "Kd", etc. that will be referenced upon construction of instances of the motors.
My question to StackOverflow is how should I organize these numbers? I know this is likely a preferential thing, but being new to coding I figure I could get some good opinions from you.
The three approaches I immediately see are as follows:
Write in the call to the constructor, either via variables or numbers-- PROS: limited coding, could be implemented in a way that it's easy to change, but possibly harder than #define's
Use #define's to accomplish similar to above -- PROS: least coding, easy to change (assuming you want to look at the source code)
Load a file (possibly named "motorparameters.txt") and load the parameters into an array and populate from that array. If I really wanted to I could add a GUI approach to changing this file rather than manual. -- PROS:easiest to change without diving into source code.
These parameters could change over time, and while there are other coders at the company, I would like to leave it in a way that's easy to configure. Do any of you see a particular benefit of #define vs. variables? I have a "constants.h" file already that I could easily add the #defines to, or I could add variables near the call to the constructor.
There's a principle know as YAGNI (You Ain't Gonna Need It) which says do the simplest thing first, then change it when (if) your requirements expand.
Sounds to me like the thing to do is:
Write a flexible motor class, that can handle any values (within reason), even though there are only 5 different sets of values you currently care about.
Define a component that returns the "right" values for the 5 motors in your system (or that constructs the 5 motors for your system using the "right values")
Initially implement that component to use some hard-coded values out of a header file
Retain the option to replace that component in future with an implementation of the same API, but that reads values out of a resource file, text file, XML file, GUI interaction with the user, off the internet, by making queries to the hardware to find out what motors it thinks it has, whatever.
I say this on the basis that you minimize expected effort by putting in a point of customizability where you suspect you'll want one (to prevent a lot of work when you change it later), but implement using the simplest thing that satisfies your current certain requirements.
Some people might say that it's not actually worth doing the typing (a) to define the component, better just to construct 5 motors in main() (b) to use constants from a header file, better just to type numeric literals in main(). The (b) people are widely despised as peddlers of "magic constants" (which doesn't mean they're necessarily wrong about relative total programming time by implementer and future maintainers, they just probably are) and the (a) people divide opinion. I tend to figure that defining this kind of thing takes a few minutes, so I don't really care whether it's worth it or not. Loading the values out of a file involves specifying a file format that I might regret as soon as I encounter a real reason to customize, so personally I can't be bothered with that until the requirement arises.
The general idea is to separate the portions of your code that will change from those that won't. The more likely something is to change, the more you need to make it easy to change.
If you're building a commercial app where hundreds or thousands of users will use many different motors, it might make sense to code up a UI and store the data in a config file.
If this is development code and these parameters are unlikely to change, then stuffing them into #defines in your constants.h file is probably the way to go.
Number 3 is a great option if you don't have security or IP concerns. Anytime you or someone else touches your code, you introduce the possibility of regressions. By keeping your parameters in a text file, not only are you making life easier on yourself, you're also reducing the scope of possible errors down the road.

More vs Less Functions

I had a little argument, and was wondering what people out there think:
In C++ (or in general), do you prefer code broken up into many shorter functions, with main() consisting of just a list of functions, in a logical order, or do you prefer functions only when necessary (i.e., when they will be reused very many times)? Or perhaps something in between?
Small functions, please
It is the conventional wisdom that smaller functions are better, and I think it's true. In fact, there is a company with an analysis tool that rates individual functions by how many decisions they make compared to the number of unit tests that they have.
The theory is that you may or may not be able to reduce complexity in an entire application, but you have complete control over how much complexity is in any given function.
A measurement called cyclomatic complexity is thought to correlate positively with bad code...specifically, the more paths there are through a method the higher its CCN number is, the more poorly it is written, the harder it is to understand and hence change or even get right to start with, and the more unit tests it will need.
Ok, found the tool. It is called, ahem, the Change Risk Analysis and Predictions index.
Lately, the principle of encoding information only once has grown new acronyms, specifically DRY (Don't Repeat Yourself) and DIE (Duplication is Evil) ...
I believe we can in part thank the RoR community for promoting this philosophy...
Split the functions, but never split functionality.
Functionality may be classified into layers, then each layer may split into different functions. For example, when we are processing a sine series, the main loop for summing and subtracting should be in primary function. This may consider as layer 1. Now the functionality for finding power may classified in to layer 2. This can be implemented as a sub function. Similarly finding factorial also belongs to layer 2 which would be another sub function. Always consider functionality, never count number of lines. Number of lines may vary from 3 to 300, doesn't matter. This will add more readability and maintainability to our code. This is my idea about splitting.
I think the only answer is something in between. If you break up functions every time possible, it becomes unreadable. Likewise, if you never break up, it also becomes unreadable.
I like to group functions into semantic differences. It is one logical unit for some calculation. Keep the units small, but big enough to actually do something useful.
My favorite granularity rule of thumb for a function is "no more than 24 lines of < 80 characters each" -- and that's not just because 80 x 24 terminals were all the rage "back when I started"; I think it's a reasonable guideline for functions you can "grasp as one eyeful", at least in C or languages not much richer than C. "A function does only one thing", AKA "a function has one function" (playing on the meaning of "function" as "role" or "purpose"!-) is a secondary rule I use in languages where "too much functionality" can easily be packed in 24 lines. But the "lexical eyeful" guideline -- 24 x 80 -- is still my main one.
Small functions are good and smaller ones are better.
About five to eight lines of code is my upper limit on function size. Beyond that, and it's too complicated. You should be able to:
Assume that a callee does what its name would indicate,
Read a function's definition in a matter of seconds, and
Convince yourself quickly that the first assumption implies that the present function is correct.
The other thing is that you should use your functions BEFORE you write their code. When you see how you intend to use the function, then you'll see what pre- and post-conditions said functions must respect.
Anything that isn't obviously correct at first glance should be proven correct in the running commentary. If that's difficult, factor sub-functions out.
Whatever helps with code reuse and readability works best, I believe.
Making lots of one line functions just to do it doesn't help with readability so they should be grouped in classes that makes sense, and then split up the functions so that you can understand quickly what is going on in that function.
If you have to jump all over to understand what is going on then your design is flawed.
I prefer functions (or methods) which fit within one screenful of code, so I can see at a glance anything I need to reference to understand how that function works. I generally have about 50 lines of space in my editor windows, which are also generally at 80 columns so I can fit two side by side on a monitor and cross reference between two pieces of code.
So, I generally consider 50 lines to be about the maximum. The only time I would consider allowing more is when you have one big long initialization function or something that is completely linear (no variables, conditionals, or loops), since that's not something where you need all that much context and some APIs require a whole bunch of initialization to get up and running, and splitting it into smaller functions wouldn't really help much.
On the whole, though, nice, small, easy to understand functions that do one thing and are well named are vastly preferable to big sprawling monstrosities that are hundreds of lines long and dozens of variables to keep track of with indentation going 10 levels deep.
Another simple reason: A function should be made when a block of code is being reused more than once or twice. For very small bits of code (say one or two statements), macros often alleviate the problem.

Testing approach for algorithms with complex outputs

How to test a result of a program that is basically a black box? For example one year ago I had to write a B tree as a homework and I really struggled with testing the correctness. What strategies do you use in such scenarios? Visualization? Robust input-->result sets of testing data? What do you do when it is hard to get such data because the only way how to get them is your proper working program?
EDIT: I think that my question was misunderstood. There was no problem with understanding how B tree works. That is trivial. But writing robust tests for validating its proper functionality is not so trivial. I think that this school problem is similar to many practical REAL word scenarios and test cases. And sometimes understanding the domain is quite different from delivering working and correct program...
EDIT2: And yes, with B tree it is possible to validate proper behavior with pen and paper. But this is really dirty and not fun :) This is not working well with problems that requires huge amount of data for their validation...
I'm not sure these answers really capture the problem at hand. A B-tree's input and output aren't any different from those of any other dictionary---but the algorithm performs better, if it's implemented correctly. It's only really got two functions to test (add, and find) so theoretically, "black-box" testing of this single component should be fine. Designing for testability isn't the issue, since no matter how you do it the whole algorithm will be one component.
So the question is: when you have to implement subtle algorithms, the kinds with complicated output that you can't always understand in your head so well, how do you test them? I think there are three different strategies you can use:
Black-box test basic functionality. For the B-tree case, this is things like cwash suggested, and also, things like making sure that when you add an item, you can then find it, etc.
Test certain invariants that your algorithm should maintain (the B-tree should be balanced, values within nodes should be sorted, etc.)
A few, small "pencil-and-paper" tests may be necessary -- work the algorithm out by hand and check that it matches what your code does. But the big-data tests can all be of type 2. These can also be brittle, so unless you need to be really sure about your algorithm, you may want to avoid them.
If you do not grasp the problem at hand, how can you develop a solution to it? My suggestion would be to understand the domain enough to be able to work out the problem on paper and ensure that your program matches.
Consult with an expert on the subject.
I know if I have a convoluted procedure I'm trying to fix, I have no idea what the output should be after my changes, so I need to consult a fellow developer with more knowledge of the business need, and they are able to verify what I've done is correct.
I would focus on constructing test cases that exercise the functionality of your B-tree algorithm. I haven't looked at it for years, but I'm fairly sure you'll be able to find a documented sequence of steps to insert a set of values in a specific order, then validate that the leaf nodes are as they should be. If you construct your testing along these lines, you should be able to prove your implementation is correct.
The key is to know there is a balance between testing something to death and doing tests that adequately cover what should be covered. Edge cases, e.g null inputs or checking inputs are numeric by testing an alphabet character or a punctuation character, are likely most of the tests you'd need. To complement this there may be one or two common cases to handle to show the program can handle a non-edge case as well. To cover all valid input in most programs is overkill and would result in an overwhelmingly large amount of tests.
I think the answer to the question you're asking boils down to designing for testability. Often you get a testable design for free when you test-drive the development of the solution. But let's face it, when you're implementing a highly mathematical algorithm, this just doesn't fall out.
To make sure you have a testable design, you need to understand what a seam is. Then you need to know a few rules of thumb, such as avoiding statics, using polymorphism, and properly decomposing problems and separating concerns.
Watch "The Clean Code Talks -- Unit Testing" by Misko Hevery, I think it will help you wrap your head around it.
Try looking at it from a requirements point of view, rather than an implementation point of view. Before you write code, you must understand exactly what you want it to do.
Testing and requirements should be a matching pair. If you're having trouble defining tests, maybe it's because the requirements are not well-defined. That in turn implies that you may have bugs that aren't so much implementation bugs, but "lack of clear requirements" bugs. The code writer in that case would be working to a mental list of requirements that he/she thinks is requirements, but can't be sure, and they're not written down for independent understanding and verification.
I've struggled with software where the requirements weren't clear, because the customer couldn't even tell us what they wanted. But when we delivered to them, they sure could tell us then what they didn't like about it! A big part of software engineering is getting the requirements right before the coding begins. This is true on the high-level (overall product, with requirement input from customer) and also the smaller level (modules, individual functions, where requirements are internally defined by software team or individuals). It is still true to some degree I think for iterative development, although the high-level requirements are more fluid.
#Bystrik Jurina,
I often get involved in projects which involve conversions between disparate data formats. Most answers have focused on testing a B-tree or similar algorithm, but it seems that you're looking for a more general answer.
Most of my work is based on the command line. It may sounds like a contradiction, but one of the first tools I use is visualization. I'll write some methods to write out my data structures in a format that's easy to consume. This can (and usually does) include something that's visually clear. But often it also means something that I could easily parse with a smaller test program, or even import into Excel.
I'll start by focusing on the basic outline, and write a program that does the bare minimum of what I need to accomplish. If it's a multi-step process, this might mean implementing one step at a time and validating the results of each step before moving on. Or writing something that works only in specific cases, and then expanding the set of cases where it's expected to work. At first you can validate that the code works in the limited set of cases, such as for known input data. As the project moves forward, you can start logging warnings for cases you might not have tested, or for unexpected types of input data. This has drawbacks, but is a nice approach when you're dealing with a known set of input data
Validation techniques can include formal test cases, or informal programs that work to challenge your assumptions. It could mean writing a basic driver program to exercise the "core" routines. A good example would be to add a record to a database, then read it back and compare the original object against the one loaded from the database.
If you have trouble wrapping your head around the way a program functions, think about what it needs to accomplish. It might be easier to writing code that tests the way different inputs produce different outputs. Producing visualizations is a good help, because the act of deciding how to display the data can make you think about different conditions and focus in on the most critical parts of your data structures.
Often I've found that building a visualization brings me to admit that the way the data is being stored just isn't very clear. For a B-tree, the representation isn't very flexible. But for other cases, you may be using parallel arrays when a nested tree of objects would be more natural.

Need refactoring ideas for Arrow Anti-Pattern

I have inherited a monster.
It is masquerading as a .NET 1.1 application processes text files that conform to Healthcare Claim Payment (ANSI 835) standards, but it's a monster. The information being processed relates to healthcare claims, EOBs, and reimbursements. These files consist of records that have an identifier in the first few positions and data fields formatted according to the specs for that type of record. Some record ids are Control Segment ids, which delimit groups of records relating to a particular type of transaction.
To process a file, my little monster reads the first record, determines the kind of transaction that is about to take place, then begins to process other records based on what kind of transaction it is currently processing. To do this, it uses a nested if. Since there are a number of record types, there are a number decisions that need to be made. Each decision involves some processing and 2-3 other decisions that need to be made based on previous decisions. That means the nested if has a lot of nests. That's where my problem lies.
This one nested if is 715 lines long. Yes, that's right. Seven-Hundred-And-Fif-Teen Lines. I'm no code analysis expert, so I downloaded a couple of freeware analysis tools and came up with a McCabe Cyclomatic Complexity rating of 49. They tell me that's a pretty high number. High as in pollen count in the Atlanta area where 100 is the standard for high and the news says "Today's pollen count is 1,523". This is one of the finest examples of the Arrow Anti-Pattern I have ever been priveleged to see. At its highest, the indentation goes 15 tabs deep.
My question is, what methods would you suggest to refactor or restructure such a thing?
I have spent some time searching for ideas, but nothing has given me a good foothold. For example, substituting a guard condition for a level is one method. I have only one of those. One nest down, fourteen to go.
Perhaps there is a design pattern that could be helpful. Would Chain of Command be a way to approach this? Keep in mind that it must stay in .NET 1.1.
Thanks for any and all ideas.
I just had some legacy code at work this week that was similar (although not as dire) as what you are describing.
There is no one thing that will get you out of this. The state machine might be the final form your code takes, but thats not going to help you get there, nor should you decide on such a solution before untangling the mess you already have.
First step I would take is to write a test for the existing code. This test isn't to show that the code is correct but to make sure you have not broken something when you start refactoring. Get a big wad of data to process, feed it to the monster, and get the output. That's your litmus test. if you can do this with a code coverage tool you will see what you test does not cover. If you can, construct some artificial records that will also exercise this code, and repeat. Once you feel you have done what you can with this task, the output data becomes your expected result for your test.
Refactoring should not change the behavior of the code. Remember that. This is why you have known input and known output data sets to validate you are not going to break things. This is your safety net.
Now Refactor!
A couple things I did that i found useful:
Invert if statements
A huge problem I had was just reading the code when I couldn't find the corresponding else statement, I noticed that a lot of the blocks looked like this
if (someCondition)
100+ lines of code
simple statement here
By inverting the if I could see the simple case and then move onto the more complex block knowing what the other one already did. not a huge change, but helped me in understanding.
Extract Method
I used this a lot.Take some complex multi line block, grok it and shove it aside in it's own method. this allowed me to more easily see where there was code duplication.
Now, hopefully, you haven't broken your code (test still passes right?), and you have more readable and better understood procedural code. Look it's already improved! But that test you wrote earlier isn't really good enough... it only tells you that you a duplicating the functionality (bugs and all) of the original code, and thats only the line you had coverage on as I'm sure you would find blocks of code that you can't figure out how to hit or just cannot ever hit (I've seen both in my work).
Now the big changes where all the big name patterns come into play is when you start looking at how you can refactor this in a proper OO fashion. There is more than one way to skin this cat, and it will involve multiple patterns. Not knowing details about the format of these files you're parsing I can only toss around some helpful suggestions that may or may not be the best solutions.
Refactoring to Patterns is a great book to assist in explainging patterns that are helpful in these situations.
You're trying to eat an elephant, and there's no other way to do it but one bite at a time. Good luck.
A state machine seems like the logical place to start, and using WF if you can swing it (sounds like you can't).
You can still implement one without WF, you just have to do it yourself. However, thinking of it like a state machine from the start will probably give you a better implementation then creating a procedural monster that checks internal state on every action.
Diagram out your states, what causes a transition. The actual code to process a record should be factored out, and called when the state executes (if that particular state requires it).
So State1's execute calls your "read a record", then based on that record transitions to another state.
The next state may read multiple records and call record processing instructions, then transition back to State1.
One thing I do in these cases is to use the 'Composed Method' pattern. See Jeremy Miller's Blog Post on this subject. The basic idea is to use the refactoring tools in your IDE to extract small meaningful methods. Once you've done that, you may be able to further refactor and extract meaningful classes.
I would start with uninhibited use of Extract Method. If you don't have it in your current Visual Studio IDE, you can either get a 3rd-party addin, or load your project in a newer VS. (It'll try to upgrade your project, but you will carefully ignore those changes instead of checking them in.)
You said that you have code indented 15 levels. Start about 1/2-way out, and Extract Method. If you can come up with a good name, use it, but if you can't, extract anyway. Split in half again. You're not going for the ideal structure here; you're trying to break the code in to pieces that will fit in your brain. My brain is not very big, so I'd keep breaking & breaking until it doesn't hurt any more.
As you go, look for any new long methods that seem to be different than the rest; make these in to new classes. Just use a simple class that has only one method for now. Heck, making the method static is fine. Not because you think they're good classes, but because you are so desperate for some organization.
Check in often as you go, so you can checkpoint your work, understand the history later, be ready to do some "real work" without needing to merge, and save your teammates the hassle of hard merging.
Eventually you'll need to go back and make sure the method names are good, that the set of methods you've created make sense, clean up the new classes, etc.
If you have a highly reliable Extract Method tool, you can get away without good automated tests. (I'd trust VS in this, for example.) Otherwise, make sure you're not breaking things, or you'll end up worse than you started: with a program that doesn't work at all.
A pairing partner would be helpful here.
Judging by the description, a state machine might be the best way to deal with it. Have an enum variable to store the current state, and implement the processing as a loop over the records, with a switch or if statements to select the action to take based on the current state and the input data. You can also easily dispatch the work to separate functions based on the state using function pointers, too, if it's getting too bulky.
There was a pretty good blog post about it at Coding Horror. I've only come across this anti-pattern once, and I pretty much just followed his steps.
Sometimes I combine the state pattern with a stack.
It works well for hierarchical structures; a parent element knows what state to push onto the stack to handle a child element, but a child doesn't have to know anything about its parent. In other words, the child doesn't know what the next state is, it simply signals that it is "complete" and gets popped off the stack. This helps to decouple the states from each other by keeping dependencies uni-directional.
It works great for processing XML with a SAX parser (the content handler just pushes and pops states to change its behavior as elements are entered and exited). EDI should lend itself to this approach too.