Should Test Code be Separate from Source/Production Code

Should Test Code be Separate from Source/Production Code - unit-testing

Even if we have a Makefile or something similar to separate the test code when shipping product.
In my opinion, they should be separate, but i am not entirely convinced as to WHY

Yes, they should be separate (folders and preferably projects). Some reasons:
GREP. Searching for a string in production source is easier.
Code coverage. Imagine trying to specify which files to include for coverage.
Different standards. You may want to run static analysis, etc. only on production code.
Simplified makefiles/build scripts.
Modern IDEs will allow you to work on code from separate projects/folders as if they were adjacent.
The worst thing you can do is to include test and production code in the same file (with conditional compilation, different entry points, etc.). Not only can this confuse developers trying to read the code, you always run the risk of accidentally shipping test code.

Since I had a chance to work with both approaches (separated and with project code), here's few tiny-things that were getting in a way to note (C#, Visual Studio, MsBuild).
Same project approach
References/external libraries dependencies: unit testing and mocking frameworks usually come with few dependencies on it's own, combine that with libraries you need for actual project and list grows very quickly (and nobody likes huge lists, right?)
Naming collisions: having class named MyClass, common approach is to name test class MyClassTest - this causes tiny annoyances when using navigation/naming completition tools (since there's bigger chance you'll have more than one result to chose from for quick navigation)
overall feeling of ubiquitous mess
Naming collisions can actually get even more tiresome, considering how classes relating to similar functionality usually share prefix (eg. ListManager ... Converter, Formatter, Provider). Navigating between comprehensible number of items (usually 3-7) is not a problem - enter tests, enter long lists again.
Separated approach
Projects number: you'll have to count the number of libraries you produce twice. Once for project code alone, another time for tests. When bigger projects are involved (200-300+ sub-projects/libraries) having that number doubled by test projects extends IDE startup time in a way you never want to experience
Of course, modern machines will mitigate projects number issue. Unless this really becomes a problem, I'd always go for separated projects approach - it's just more neat and clean and way easier to manage.

In fully automated building and unit testing framework, you can essentially separate them out.
It makes more sense to fire up the automated unit tests after the nightly build is done.
Keeping them separate makes it easier for maintenance purposes.

Related

Which is better? Unit-test project per solution or per project?

Is it better to have a unit-test project per solution or a unit-test project per project?
With per solution, if you have 5 projects in the solution you end-up with 1 unit-test project containing tests for each of the 5 projects.
With per project, if you have 5 projects in the solution you end-up with 5 unit-test projects.
What is the right way?
I think it's not the same question as Write Unit tests into an assembly or in a separate assembly?

Assemblies are a packaging/deployment concern, so we usually split them out because we don't want to deploy them with our product. Whether you split them out per library or per solution there are merits to both.
Ultimately, you want tests to be immediately available to all developers, so that developers know where to find them when needed. You also want an obstacle free environment with minimal overhead to writing new tests so that you aren't arming the cynics who don't want to write tests. Tests must also compile and execute quickly - project structure can play a part in all of this.
You may also want to consider that different levels of testing are possible, such as tests for unit, integration or UI automation. Segregating these types of tests is possible in some tools by using test categories, but sometimes it's easier for execution or reporting if they are separate libraries.
If you have special packaging considerations such as a modular application where modules should not be aware of one another, your test projects should also reflect this.
In small projects where there aren't a lot of projects, a 1:1 ratio is usually the preferred approach. However, Visual Studio performance quickly degrades as the number of projects increases. Around the 40 project mark compilation becomes an obstacle to compiling and running the tests, so larger projects may benefit from consolidating test projects.
I tend to prefer a pragmatic approach so that complexity is appropriate to the problem. Typically, an application will be comprised of several layers where each layer may have multiple projects. I like to start with a single test library per layer and I mimic the solution structure using folders. Divide when complexity warrants it. If you design your test projects for flexibility then changeover is usually painless.

I would say a separate project for each unit test project rather than in one project per solution. I think this is better because it will save you a lot of hassle if you decide to take a particular project out of a solution and move it to another solution.

We have one to one ratio from file to projects of solution in a very large system, we have reached a point where build takes more than 90 mins every time we check-in. I have created new solution configuration which builds only test projects, new configuration runs once a days to make sure all test cases are working ,developers can switch to unittest config to test their code in their development environment. Pl. let me know your feed back

I am going with one of those "it depends" answers. Personally, I tend to put all of the tests in a single project, with separate folders within the project for each assembly (plus further sub-folders as necessary.) This makes it easy to run the entire set, either from within VisualStudio or CruiseControl.net.
If you have thousands of tests, a single project might prove too difficult to maintain. Also, as Peter Kelly mentioned in his response, being able to easily split the tests if you move a project can be useful.

Personally I write one test assembly per project. I then create one nunit project per solution and reference all of the relevant test assemblies from that. This reflects the organisation of the projects and means that if projects are reused in different solutions then only the relevant unit tests need to be run.

What is the best approach for coding in a slow compilation environment?

I used to be coding in C# in a TDD style - write/or change a small chunk of code, re-compile in 10 seconds the whole solution, re-run the tests and again. Easy...
That development methodology worked very well for me for a few years, until a last year when I had to go back to C++ coding and it really feels that my productivity has dramatically decreased since. The C++ as a language is not a problem - I had quite a lot of C++ dev experience... but in the past.
My productivity is still OK for a small projects, but it gets worse when with the increase of the project size and once compilation time hits 10+ minutes it gets really bad. And if I find the error I have to start compilation again, etc. That is just purely frustrating.
Thus I concluded that in a small chunks (as before) is not acceptable - any recommendations how can I get myself into the old gone habit of coding for an hour or so, when reviewing the code manually (without relying on a fast C# compiler), and only recompiling/re-running unit tests once in a couple of hours.
With a C# and TDD it was very easy to write a code in a evolutionary way - after a dozen of iterations whatever crap I started with was ending up in a good code, but it just does not work for me anymore (in a slow compilation environment).
Would really appreciate your inputs and recos.
p.s. not sure how to tag the question - anyone is welcome to re-tag the question appropriately.
Cheers.

I've found that recompiling and testing sort of pulls me out of the "zone", so in order to have the benefits of TDD, I commit fairly often into a git repository, and run a background process that checks out any new commit, runs the full test suite and annotates the commit object in git with the result. When I get around to it (usually in the evening), I then go back to the test results, fix any issues and "rewrite history", then re-run the tests on the new history. This way I don't have to interrupt my work even for the short times it takes to recompile (most of) my projects.

Sometimes you can avoid the long compile. Aside from improving the quality of your build files/process, you may be able to pick just a small thing to build. If the file you're working on is a .cpp file, just compile that one TU and unit-test it in isolation from the rest of the project. If it's a header (perhaps containing inline functions and templates), do the same with a small number of TUs that between them reference most of the functionality (if no such set of TUs exists, write unit tests for the header file and use those). This lets you quickly detect obvious stupid errors (like typos) that don't compile, and runs the subset of tests you believe to be relevant to the changes you're making. Once you have something that might vaguely work, do a proper build/test of the project to ensure you haven't broken anything you didn't realise was relevant.
Where a long compile/test cycle is unavoidable, I work on two things at once. For this to be efficient, one of them needs to be simple enough that it can just be dropped when the main task is ready to be resumed, and picked up again immediately when the main task's compile/test cycle is finished. This takes a bit of planning. And of course the secondary task has its own build/test cycle, so sometimes you want to work in separate checked-out copies of the source so that errors in one don't block the other.
The secondary task could for example be, "speed up the partial compilation time of the main task by reducing inter-component dependencies". Even so you may have hit a hard limit once it's taking 10 minutes just to link your program's executable, since splitting the thing into multiple dlls just as a development hack probably isn't a good idea. The key thing to avoid is for the secondary task to be, "hit SO", or this.

Since a simple change triggers a 10 minutes recompilation, that means you have a bad build system. Your build should recompile only changed files and files depending on the changed files.
Other then that, there are other techniques to speed up the build time (For example, try to remove unneeded includes. Then instead of including a header, use forward declaration. etc ), but the speed up of these things is not that important as what is recompiled on a change.

I don't see why you can't use TDD with C++. I used CppUnit back in 2001, so I assume it's still in place.
You don't say what IDE or build tool you're using, so I can't comment on how those affect your pace. But small, incremental compiles and running unit tests are both still possible.
Perhaps looking into Cruise Control, Team City, or another hands-off build and test process would be your cup of tea. You can just check in as fast as you can and let the automated build happen on another server.

What to do about a 11000 lines C++ source file?

So we have this huge (is 11000 lines huge?) mainmodule.cpp source file in our project and every time I have to touch it I cringe.
As this file is so central and large, it keeps accumulating more and more code and I can't think of a good way to make it actually start to shrink.
The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it. If I were to "simply" split it up, say for a start, into 3 files, then merging back changes from maintenance versions will become a nightmare. And also if you split up a file with such a long and rich history, tracking and checking old changes in the SCC history suddenly becomes a lot harder.
The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows. :-(
What would you do in this situation? Any ideas on how to move new features to a separate source file without messing up the SCC workflow?
(Note on the tools: We use C++ with Visual Studio; We use AccuRev as SCC but I think the type of SCC doesn't really matter here; We use Araxis Merge to do actual comparison and merging of files)

Merging will not be such a big nightmare as it will be when you'll get 30000 LOC file in the future. So:
Stop adding more code to that file.
Split it.
If you can't just stop coding during refactoring process, you could leave this big file as is for a while at least without adding more code to it: since it contains one "main class" you could inherit from it and keep inherited class(es) with overloaded functions in several new small and well designed files.

Find some code in the file which is relatively stable (not changing fast, and doesn't vary much between branches) and could stand as an independent unit. Move this into its own file, and for that matter into its own class, in all branches. Because it's stable, this won't cause (many) "awkward" merges that have to be applied to a different file from the one they were originally made on, when you merge the change from one branch to another. Repeat.
Find some code in the file which basically only applies to a small number of branches, and could stand alone. Doesn't matter whether it's changing fast or not, because of the small number of branches. Move this into its own classes and files. Repeat.
So, we've got rid of the code that's the same everywhere, and the code that's specific to certain branches.
This leaves you with a nucleus of badly-managed code - it's needed everywhere, but it's different in every branch (and/or it changes constantly so that some branches are running behind others), and yet it's in a single file that you're unsuccessfully trying to merge between branches. Stop doing that. Branch the file permanently, perhaps by renaming it in each branch. It's not "main" any more, it's "main for configuration X". OK, so you lose the ability to apply the same change to multiple branches by merging, but this is in any case the core of code where merging doesn't work very well. If you're having to manually manage the merges anyway to deal with conflicts, then it's no loss to manually apply them independently on each branch.
I think you're wrong to say that the kind of SCC doesn't matter, because for example git's merging abilities are probably better than the merge tool you're using. So the core problem, "merging is difficult" occurs at different times for different SCCs. However, you're unlikely to be able to change SCCs, so the issue is probably irrelevant.

It sounds to me like you're facing a number of code smells here. First of all the main class appears to violate the open/closed principle. It also sounds like it is handling too many responsibilities. Due to this I would assume the code to be more brittle than it needs to be.
While I can understand your concerns regarding traceability following a refactoring, I would expect that this class is rather hard to maintain and enhance and that any changes you do make are likely to cause side effects. I would assume that the cost of these outweighs the cost of refactoring the class.
In any case, since the code smells will only get worse with time, at least at some point the cost of these will outweigh the cost of refactoring. From your description I would assume that you're past the tipping point.
Refactoring this should be done in small steps. If possible add automated tests to verify current behavior before refactoring anything. Then pick out small areas of isolated functionality and extract these as types in order to delegate the responsibility.
In any case, it sounds like a major project, so good luck :)

The only solution I have ever imagined to such problems follows. The actual gain by the described method is progressiveness of the evolutions. No revolutions here, otherwise you'll be in trouble very fast.
Insert a new cpp class above the original main class. For now, it would basically redirect all calls to the current main class, but aim at making the API of this new class as clear and succinct as possible.
Once this has been done, you get the possibility to add new functionalities in new classes.
As for existing functionalities, you have to progressively move them in new classes as they become stable enough. You will lose SCC help for this piece of code, but there is not much that can be done about that. Just pick the right timing.
I know this is not perfect, though I hope it can help, and the process must be adapted to your needs!
Additional information
Note that Git is an SCC that can follow pieces of code from one file to another. I have heard good things about it, so it could help while you are progressively moving your work.
Git is constructed around the notion of blobs which, if I understand correctly, represent pieces of code files. Move these pieces around in different files and Git will find them, even if you modify them. Apart from the video from Linus Torvalds mentioned in comments below, I have not been able to find something clear about this.

Confucius say: "first step to getting out of hole is to stop digging hole."

Let me guess: Ten clients with divergent feature sets and a sales manager that promotes "customization"? I've worked on products like that before. We had essentially the same problem.
You recognize that having an enormous file is trouble, but even more trouble is ten versions that you have to keep "current". That's multiple maintenance. SCC can make that easier, but it can't make it right.
Before you try to break the file into parts, you need to bring the ten branches back in sync with each other so that you can see and shape all the code at once. You can do this one branch at a time, testing both branches against the same main code file. To enforce the custom behavior, you can use #ifdef and friends, but it's better as much as possible to use ordinary if/else against defined constants. This way, your compiler will verify all types and most probably eliminate "dead" object code anyway. (You may want to turn off the warning about dead code, though.)
Once there's only one version of that file shared implicitly by all branches, then it's rather easier to begin traditional refactoring methods.
The #ifdefs are primarily better for sections where the affected code only makes sense in the context of other per-branch customizations. One may argue that these also present an opportunity for the same branch-merging scheme, but don't go hog-wild. One colossal project at a time, please.
In the short run, the file will appear to grow. This is OK. What you're doing is bringing things together that need to be together. Afterwards, you'll begin to see areas that are clearly the same regardless of version; these can be left alone or refactored at will. Other areas will clearly differ depending on the version. You have a number of options in this case. One method is to delegate the differences to per-version strategy objects. Another is to derive client versions from a common abstract class. But none of these transformations are possible as long as you have ten "tips" of development in different branches.

I don't know if this solves your problem, but what I guess you want to do is migrate the content of the file to smaller files independent of each other (summed up).
What I also get is that you have about 10 different versions of the software floating around and you need to support them all without messing things up.
First of all there is just no way that this is easy and will solve itself in a few minutes of brainstorming. The functions linked in your file are all vital to your application, and simply cutting them of and migrating them to other files won't save your problem.
I think you only have these options:
Don't migrate and stay with what you have. Possibly quit your job and start working on serious software with good design in addition. Extreme programming is not always the best solution if you are working on a long time project with enough funds to survive a crash or two.
Work out a layout of how you would love your file to look once it's split up. Create the necessary files and integrate them in your application. Rename the functions or overload them to take an additional parameter (maybe just a simple boolean?).
Once you have to work on your code, migrate the functions you need to work on to the new file and map the function calls of the old functions to the new functions.
You should still have your main-file this way, and still be able to see the changes that were made to it, once it comes to a specific function you know exactly when it was outsourced and so on.
Try to convince your co-workers with some good cake that workflow is overrated and that you need to rewrite some parts of the application in order to do serious business.

Exactly this problem is handled in one of the chapters of the book "Working Effectively with Legacy Code" (http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052).

I think you would be best off creating a set of command classes that map to the API points of the mainmodule.cpp.
Once they are in place, you will need to refactor the existing code base to access these API points via the command classes, once that's done, you are free to refactor each command's implementation into a new class structure.
Of course, with a single class of 11 KLOC the code in there is probably highly coupled and brittle, but creating individual command classes will help much more than any other proxy/facade strategy.
I don't envy the task, but as time goes on this problem will only get worse if it's not tackled.
Update
I'd suggest that the Command pattern is preferable to a Facade.
Maintaining/organizing a lot of different Command classes over a (relatively) monolithic Facade is preferable. Mapping a single Facade onto a 11 KLOC file will probably need to be broken up into a few different groups itself.
Why bother trying to figure out these facade groups? With the Command pattern you will be able to group and organise these small classes organically, so you have a lot more flexibility.
Of course, both options are better than the single 11 KLOC and growing, file.

One important advice: Do not mix refactoring and bugfixes. What you want is a Version of your program that is identical to the previous version, except that the source code is differently.
One way could be to start splitting up the least big function/part into it's own file and then either include with a header (thus turning main.cpp into a list of #includes, which sounds a code smell in itself *I'm not a C++ Guru though), but at least it's now split into files).
You could then try to switch all maintenance releases over to the "new" main.cpp or whatever your structure is. Again: No other changes or Bugfixes because tracking those is confusing as hell.
Another thing: As much as you may desire making one big pass at refactoring the whole thing in one go, you might bite off more than you can chew. Maybe just pick one or two "parts", get them into all the releases, then add some more value for your customer (after all, Refactoring does not add direct value so it is a cost that has to be justified) and then pick another one or two parts.
Obviously that requires some discipline in the team to actually use the split files and not just add new stuff to the main.cpp all the time, but again, trying to do one massive refactor may not be the best course of action.

Rofl, this reminds me of my old job. It seems that, before I joined, everything was inside one huge file (also C++). Then they've split it up (at completely random points using includes) into about three (still huge files). The quality of this software was, as you might expect, horrible. The project totaled at about 40k LOC. (containing almost no comments but LOTS of duplicate code)
In the end I did a complete rewrite of the project. I started by redoing the worst part of the project from scratch. Of course I had in mind a possible (small) interface between this new part and the rest. Then I did insert this part into the old project. I didn't refactor the old code to create the interface necessary, but just replaced it. Then I took made small steps from there, rewriting the old code.
I have to say that this took about half a year and there was no development of the old code base beside bugfixes during that time.
edit:
The size stayed at about 40k LOC but the new application contained many more features and presumably less bugs in its initial version than the 8 year old software. One reason of the rewrite was also that we needed the new features and introducing them inside the old code was nearly impossible.
The software was for an embedded system, a label printer.
Another point that I should add is that in theory the project was C++. But it wasn't OO at all, it could have been C. The new version was object oriented.

OK so for the most part rewriting API of production code is a bad idea as a start. Two things need to happen.
One, you need to actually have your team decide to do a code freeze on current production version of this file.
Two, you need to take this production version and create a branch that manages the builds using preprocessing directives to split up the big file. Splitting the compilation using JUST preprocessor directives (#ifdefs, #includes, #endifs) is easier than recoding the API. It's definitely easier for your SLAs and ongoing support.
Here you could simply cut out functions that relate to a particular subsystem within the class and put them in a file say mainloop_foostuff.cpp and include it in mainloop.cpp at the right location.
OR
A more time consuming but robust way would be to devise an internal dependencies structure with double-indirection in how things get included. This will allow you to split things up and still take care of co-dependencies. Note that this approach requires positional coding and therefore should be coupled with appropriate comments.
This approach would include components that get used based on which variant you are compiling.
The basic structure is that your mainclass.cpp will include a new file called MainClassComponents.cpp after a block of statements like the following:
#if VARIANT == 1
# define Uses_Component_1
# define Uses_Component_2
#elif VARIANT == 2
# define Uses_Component_1
# define Uses_Component_3
# define Uses_Component_6
...
#endif
#include "MainClassComponents.cpp"
The primary structure of the MainClassComponents.cpp file would be there to work out dependencies within the sub components like this:
#ifndef _MainClassComponents_cpp
#define _MainClassComponents_cpp
/* dependencies declarations */
#if defined(Activate_Component_1)
#define _REQUIRES_COMPONENT_1
#define _REQUIRES_COMPONENT_3 /* you also need component 3 for component 1 */
#endif
#if defined(Activate_Component_2)
#define _REQUIRES_COMPONENT_2
#define _REQUIRES_COMPONENT_15 /* you also need component 15 for this component */
#endif
/* later on in the header */
#ifdef _REQUIRES_COMPONENT_1
#include "component_1.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_2
#include "component_2.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_3
#include "component_3.cpp"
#endif
#endif /* _MainClassComponents_h */
And now for each component you create a component_xx.cpp file.
Of course i am using numbers but you should use something more logical based on your code.
Using preprocessor allows you to split things up without having to worry about API changes which is a nightmare in production.
Once you have production settled you can then actually work on redesign.

Well I understand your pain :) I've been in a few such projects as well and it's not pretty. There is no easy answer for this.
One approach that may work for you is to start adding safe guards in all functions, that is, checking arguments, pre/post-conditions in methods, then eventually adding unit tests all in order to capture the current functionality of the sources. Once you have this you are better equipped to re-factor the code because you will have asserts and errors popping up alerting you if you have forgotten something.
Sometimes though there are times when refactoring just may bring more pain than benefit. Then it may be better to just leave the original project and in a pseudo maintenance state and start from scratch and then incrementally adding the functionality from the beast.

You should not be concerned with reducing the file-size, but rather with reducing the class-size. It comes down to almost the same, but makes you look at the problem from a different angle (as #Brian Rasmussen suggests, your class seems to have to many responsibilities).

What you have is a classic example a known design antipattern called the blob. Take some time to read the article I point here, and maybe you may find something useful. Besides, if this project is as big as it looks, you should consider some design to prevent growing into code that you can't control.

This isn't an answer to the big problem, but a theoretical solution to a specific piece of it:
Figure out where you want to split the big file into subfiles. Put comments in some special format at each of those points.
Write a fairly trivial script that will break the file apart into subfiles at those points. (Perhaps the special comments have embedded filenames that the script can use as instructions for how to split it.) It should preserve the comments as part of the splitting.
Run the script. Delete the original file.
When you need to merge from a branch, first recreate the big file by concatenating the pieces back together, do the merge, and then re-split it.
Also, if you want to preserve the SCC file history, I expect the best way to do that is to tell your source control system that the individual piece files are copies of the original. Then it will preserve the history of the sections that were kept in that file, although of course it will also record that large parts were "deleted".

One way to split it without too much danger would be to take a historic look at all the line changes. Are there certain functions that are more stable than others? Hot spots of change if you will.
If a line hasn't been changed in a few years you can probably move it to another file without too much worry. I'd take a look at the source annotated with the last revision that touched a given line and see if there are any functions you could pull out.

Wow, sounds great. I think explaining to your boss, that you need a lot of time to refactor the beast is worth a try. If he doesn't agree, quitting is an option.
Anyway, what I suggest is basically throwing out all the implementation and regrouping it into new modules, let's call those "global services". The "main module" would only forward to those services and ANY new code you write will use them instead of the "main module". This should be feasible in a reasonable amount of time (because it's mostly copy and paste), you don't break existing code and you can do it one maintenance version at a time. And if you still have any time left, you can spend it refactoring all old depending modules to also use the global services.

Do not ever touch this file and the code again!
Treat is like something you are stuck with. Start writing adapters for the functionality encoded there.
Write new code in different units and talk only to adapters which encapsulate the functionality of the monster.
... if only one of the above is not possible, quit the job and get you a new one.

My sympathies - in my previous job I encountered a similar situation with a file that was several times larger than the one you have to deal with. Solution was:
Write code to exhaustively test the function in the program in question. Sounds like you won't already have this in hand...
Identify some code that can be abstracted out into a helper/utilities class. Need not be big, just something that is not truly part of your 'main' class.
Refactor the code identified in 2. into a separate class.
Rerun your tests to ensure nothing got broken.
When you have time, goto 2. and repeat as required to make the code manageable.
The classes you build in step 3. iterations will likely grow to absorb more code that is appropriate to their newly-clear function.
I could also add:
0: buy Michael Feathers' book on working with legacy code
Unfortunately this type of work is all too common, but my experience is that there is great value in being able to make working but horrid code incrementally less horrid while keeping it working.

Consider ways to rewrite the entire application in a more sensible way. Maybe rewrite a small section of it as a prototype to see if your idea is feasible.
If you've identified a workable solution, refactor the application accordingly.
If all attempts to produce a more rational architecture fail, then at least you know the solution is probably in redefining the program's functionality.

My 0.05 eurocents:
Re-design the whole mess, split it into subsystems taking into account the technical and business requirements (=many parallel maintenance tracks with potentially different codebase for each, there is obviously a need for high modifiability, etc.).
When splitting into subsystems, analyze the places which have most changed and separate those from the unchanging parts. This should show you the trouble-spots. Separate the most changing parts to their own modules (e.g. dll) in such a way that the module API can be kept intact and you don't need to break BC all the time. This way you can deploy different versions of the module for different maintenance branches, if needed, while having the core unchanged.
The redesign will likely need to be a separate project, trying to do it to a moving target will not work.
As for the source code history, my opinion: forget it for the new code. But keep the history somewhere so you can check it, if needed. I bet you won't need it that much after the beginning.
You most likely need to get management buy-in for this project. You can argue perhaps with faster development time, less bugs, easier maintaining and less overall chaos. Something along the lines of "Proactively enable the future-proofness and maintenance viability of our critical software assets" :)
This is how I'd start to tackle the problem at least.

Start by adding comments to it. With reference to where functions are called and if you can move things around. This can get things moving. You really need to assess how brittle the code base it. Then move common bits of functionality together. Small changes at a time.

Another book you may find interesting/helpful is Refactoring.

Something I find useful to do (and I'm doing it now although not at the scale you face), is to extract methods as classes (method object refactoring). The methods that differ across your different versions will become different classes which can be injected into a common base to provide the different behaviour you need.

I found this sentence to be the most interesting part of your post:
> The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it
First, I would recommend that you use a source control system for developing these 10 + maintenance versions that supports branching.
Second, I would create ten branches (one for each of your maintenance versions).
I can feel you cringing already! But either your source control isn't working for your situation because of a lack of features, or it's not being used correctly.
Now to the branch you work on - refactor it as you see fit, safe in the knowledge that you'll not upset the other nine branches of your product.
I would be a bit concerned that you have so much in your main() function.
In any projects I write, I would use main() only perform initialization of core objects - like a simulation or application object - these classes is where the real work should go on.
I would also initialize an application logging object in main for use globally throughout the program.
Finally, in main I also add leak detection code in preprocessor blocks that ensure it's only enabled in DEBUG builds. This is all I would add to main(). Main() should be short!
You say that
> The file basically contains the "main class" (main internal work dispatching and coordination) of our program
It sounds like these two tasks could be split into two separate objects - a co-ordinator and a work dispatcher.
When you split these up, you may mess up your "SCC workflow", but it sounds like adhering stringently to your SCC workflow is causing software maintenance problems. Ditch it, now and don't look back, because as soon as you fix it, you'll begin to sleep easy.
If you're not able to make the decision, fight tooth and nail with your manager for it - your application needs to be refactored - and badly by the sounds of it! Don't take no for an answer!

As you've described it, the main issue is diffing pre-split vs post-split, merging in bug fixes etc.. Tool around it. It won't take that long to hardcode a script in Perl, Ruby, etc. to rip out most of the noise from diffing pre-split against a concatenation of post-split. Do whatever's easiest in terms of handling noise:
remove certain lines pre/during concatenation (e.g. include guards)
remove other stuff from the diff output if necessary
You could even make it so whenever there's a checkin, the concatenation runs and you've got something prepared to diff against the single-file versions.

"The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows."
If that big SWITCH (which I think there is) becomes the main maintenance problem, you could refactor it to use dictionary and the Command pattern and remove all switch logic from the existing code to the loader, which populates that map, i.e.:
// declaration
std::map<ID, ICommand*> dispatchTable;
...
// populating using some loader
dispatchTable[id] = concreteCommand;
...
// using
dispatchTable[id]->Execute();

I think the easiest way to track the history of source when splitting a file would be something like this:
Make copies of the original source code, using whatever history-preserving copy commands your SCM system provides. You'll probably need to submit at this point, but there's no need yet to tell your build system about the new files, so that should be ok.
Delete code from these copies. That should not break the history for the lines you keep.

I think what I would do in this situation is bit the bullet and:
Figure out how I wanted to split the file up (based on the current development version)
Put an administrative lock on the file ("Nobody touch mainmodule.cpp after 5pm Friday!!!"
Spend your long weekend applying that change to the >10 maintenance versions (from oldest to newest), up to and including the current version.
Delete mainmodule.cpp from all supported versions of the software. It's a new Age - there is no more mainmodule.cpp.
Convince Management that you shouldn't be supporting more than one maintenance version of the software (at least without a big $$$ support contract). If each of your customers have their own unique version.... yeeeeeshhhh. I'd be adding compiler directives rather than trying to maintain 10+ forks.
Tracking old changes to the file is simply solved by your first check-in comment saying something like "split from mainmodule.cpp". If you need to go back to something recent, most people will remember the change, if it's 2 year from now, the comment will tell them where to look. Of course, how valuable will it be to go back more than 2 years to look at who changed the code and why?

Where do you put your unit test?

I have found several conventions to housekeeping unit tests in a project and
I'm not sure which approach would be suitable for our next PHP project. I am
trying to find the best convention to encourage easy development and
accessibility of the tests when reviewing the source code. I would be very
interested in your experience/opinion regarding each:
One folder for productive code, another for unit tests: This separates
unit tests from the logic files of the project. This separation of
concerns is as much a nuisance as it is an advantage: Someone looking into
the source code of the project will - so I suppose - either browse the
implementation or the unit tests (or more commonly: the implementation
only). The advantage of unit tests being another viewpoint to your classes
is lost - those two viewpoints are just too far apart IMO.
Annotated test methods: Any modern unit testing framework I know allows
developers to create dedicated test methods, annotating them (#test) and
embedding them in the project code. The big drawback I see here is that
the project files get cluttered. Even if these methods are separated
using a comment header (like UNIT TESTS below this line) it just bloats
the class unnecessarily.
Test files within the same folders as the implementation files: Our file
naming convention dictates that PHP files containing classes (one class
per file) should end with .class.php. I could imagine that putting unit
tests regarding a class file into another one ending on .test.php would
render the tests much more present to other developers without tainting
the class. Although it bloats the project folders, instead of the
implementation files, this is my favorite so far, but I have my doubts: I
would think others have come up with this already, and discarded this
option for some reason (i.e. I have not seen a java project with the files
Foo.java and FooTest.java within the same folder.) Maybe it's because
java developers make heavier use of IDEs that allow them easier access to
the tests, whereas in PHP no big editors have emerged (like eclipse for
java) - many devs I know use vim/emacs or similar editors with little
support for PHP development per se.
What is your experience with any of these unit test placements? Do you have
another convention I haven't listed here? Or am I just overrating unit test
accessibility to reviewers?

I favour keeping unit-tests in separate source files in the same directory as production code (#3).
Unit tests are not second-class citizens, their code must maintained and refactored just like production code. If you keep your unit tests in a separate directory, the next developer to change your production code may miss that there are unit tests for it and fail to maintain the tests.
In C++, I tend to have three files per class:
MyClass.h
MyClass.cpp
t_MyClass.cpp
If you're using Vim, then my toggle_unit_tests plug-in for toggling between source and unit test files may prove useful.

The current best practice is to separate the unit tests into their own directory, #1. All of the "convention over configuration" systems do it like this, eg. Maven, Rails, etc.
I think your alternatives are interesting and valid, and the tool support is certainly there to support them. But it's just not that popular (as far as I know). Some people object to having tests interspersed with production code. But it makes sense to me that if you always write unit tests, that they be located right with your code. It just seems simpler.

I always go for #1. While it's nice that they are close together. My reasons are as followed:
I feel there's a difference between the core codebase and the unittests. I need a real separation.
End-users rarely need to look at unittests. They are just interested in the API. While it's nice that unittests provide a separate view on the code, in practice I feel it won't be used to understand it better. (more descriptive documentation + examples do).
Because end-users rarely need unittests, I don't want to confuse them with more files and/or methods.
My coding standards are not half as strict for unittests as they are for the core library. This is maybe just my opinion, but I don't care as much for coding standards in my tests.
Hope this helps.

C++ Unit Testing Legacy Code: How to handle #include?

I've just started writing unit tests for a legacy code module with large physical dependencies using the #include directive. I've been dealing with them a few ways that felt overly tedious (providing empty headers to break long #include dependency lists, and using #define to prevent classes from being compiled) and was looking for some better strategies for handling these problems.
I've been frequently running into the problem of duplicating almost every header file with a blank version in order to separate the class I'm testing in it's entirety, and then writing substantial stub/mock/fake code for objects that will need to be replaced since they're now undefined.
Anyone know some better practices?

The depression in the responses is overwhelming... But don't fear, we've got the holy book to exorcise the demons of legacy C++ code. Seriously just buy the book if you are in line for more than a week of jousting with legacy C++ code.
Turn to page 127: The case of the horrible include dependencies. (Now I am not even within miles of Michael Feathers but here as-short-as-I-could-manage answer..)
Problem: In C++ if a classA needs to know about ClassB, Class B's declaration is straight-lifted / textually included in the ClassA's source file. And since we programmers love to take it to the wrong extreme, a file can recursively include a zillion others transitively. Builds take years.. but hey atleast it builds.. we can wait.
Now to say 'instantiating ClassA under a test harness is difficult' is an understatement. (Quoting MF's example - Scheduler is our poster problem child with deps galore.)
#include "TestHarness.h"
#include "Scheduler.h"
TEST(create, Scheduler) // your fave C++ test framework macro
{
Scheduler scheduler("fred");
}
This will bring out the includes dragon with a flurry of build errors.
Blow#1 Patience-n-Persistence: Take on each include one at a time and decide if we really need that dependency. Let's assume SchedulerDisplay is one of them, whose displayEntry method is called in Scheduler's ctor.
Blow#2 Fake-it-till-you-make-it (Thanks RonJ):
#include "TestHarness.h"
#include "Scheduler.h"
void SchedulerDisplay::displayEntry(const string& entryDescription) {}
TEST(create, Scheduler)
{
Scheduler scheduler("fred");
}
And pop goes the dependency and all its transitive includes.
You can also reuse the Fake methods by encapsulating it in a Fakes.h file to be included in your test files.
Blow#3 Practice: It may not be always that simple.. but you get the idea. After the first few duels, the process of breaking deps will get easy-n-mechanical
Caveats (Did I mention there are caveats? :)
We need a separate build for test cases in this file ; we can have only 1 definition for the SchedulerDisplay::displayEntry method in a program. So create a separate program for scheduler tests.
We aren't breaking any dependencies in the program, so we are not making the code cleaner.
You need to maintain those fakes as long as we need the tests.
Your sense of aesthetics may be offended for a while.. just bite your lip and 'bear with us for a better tomorrow'
Use this technique for a very huge class with severe dependency issues. Don't use often or lightly.. Use this as a starting point for deeper refactorings. Over time this testing program can be taken behind the barn as you extract more classes (WITH their own tests).
For more.. please do read the book. Invaluable. Fight on bro!

Since you're testing legacy code I'm assuming you can't refactor said code to have less dependencies (e.g. by using the pimpl idiom)
That leaves you with little options I'm afraid. Every header that was included for a type or function will need a mock object for that type or function for everything to compile, there's little you can do...

I am not answering your question directly but I am afraid that unit testing just may not be the thing to do if you work with large amounts of legacy code.
After leading an XP team on a green field development project I really loved my Unit tests. Things happened and a few years later I find myself working on a large legacy code base that has lots of quality problems.
I tried to find a way to add units tests to the application but in the end just got stuck in a catch-22:
In order to write meaning full unit tests the code would need to be refactored.
Without unit tests it will be too dangerous to refactor the code.
If you feel like a hero and drink the cool-aid on unit testing then you may still give it a try but there is a real risk that you end up with just more test code of little value that now also needs to be maintained.
Sometimes it is just best to work on the code in the way that is "designed" to be worked on.

I don't know if this will work for your project but
you might try to attack the problem from the link phase of your build.
This would completely eliminate your #include problem.
All you would need to do is re-implement the interfaces in the included files to do what ever you want and then just link to the mock object files that you have created to implement the interfaces in the include file.
The big disadvantage to this method is a more complected build system.

If you keep writing stubs/mock/fake codes you risk doing unit testing on a class that has different behavior then when compiled on the main project.
But if those includes are there and have no added behavior then it's Ok.
I'd try not changing anything on the includes while doing the unit testing so you're sure (as far you can be on legacy code :) ) that you testing the real code.

You're definitely between a rock and a hard place with legacy code with large dependencies. You've got a long hard slog ahead to sort it all out.
From what you say, it seems you are trying to keep the source code intact for each module in turn, placing it in a test harness with external dependencies mocked out. My suggestion here would be to take the even braver step of attempting some refactoring to eliminate (or invert) the dependencies, which is probably the very step you are trying to avoid.
I suggest this because I'm guessing the dependencies are going to kill you as you write tests. You will certainly be better off in the long term if you can eliminate the dependencies.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js