What's with clojure's long files?

What's with clojure's long files? - clojure

I've been learning clojure for some weeks and recently I began reading some open source code: clojure and clojurescript compilers and some libraries like om, boot, figwheel.
I noticed some clojure files are very long, some of them more than a thousand LOC. Given that clojure's code is very terse and low ceremony, that code means much more code than a file that big in some other languages.
Coming from an OO background where you usually have one class per file and you try to keep your classes short (SRP) I found that a little weird.
I know that clojure code is mostly composed of pure functions and they're way easier to reason about than some mutable class where you need to keep the current state in mind, and I find that I can read and understand most of the functions one at a time. But most of those functions are very well designed so that they don't depend on each other: even though you can use (filter odd?) it doesn't mean that filter and odd? are related. But for "every day" code (LOB apps, web apps, etc) is very hard to keep the functions as self contained as those (at least that's my experience with OO programming).
I've also seen some demos of clojurescript applications (om, reagent, etc) where they declare all components in the same file. I don't know if that's because it's just a demo and in a real life application you'd have a product.clj and a category.clj or that's just the clojure way: to have one file per namespace/module/bounded-context.
I think that if I open a folder and I see product.clj, category.clj, order.clj, etc I can get the idea at a glance what's that folder about, better than just having a components.clj or core.clj.
So, my questions are:
Is it common for "every day" clojure code to have these very long files? or is it just because I'm reading libraries code and "normal" code is more "modular", I mean: more files and less length.
Does having long files like those actually make it harder to comprehend at a glance what's the application about? like my product/category/order example above, or by some clojuresque property that's not an issue.
In case long files are the "clojure way", how do you handle conflicts, refactorings, programming in a team... if everybody is touching the same file?

1: I looked at the reasonably large non-library clojure project i'm working on right now and ran this:
ls **/*.clj | xargs wc -l | awk '{print $1}' | head -n -1 > counts
and opend a repl and ran
user> (float (/ (reduce + counts) (count counts)))
208.76471
I see that on a project with 17k LOC our average clojure file has 200 lines in it. I found one with 1k LOC.
2: Yes, I'll get started breaking that long one down as soon as I have free time. some very long ones such as clojure.core are very long becaues of clojure's one pass design and the need to self-bootstrap. they need to build the ability to have many namespaces for instance before they can do so. For other fancy libraries it may very well be that they had some other design reason for a large file though usually it's a case of "pull request welcome" in my expierence.
3: I do work in a large team with a few large files, we handle merge conflicts with git, though because changes tend to be within a function these come up, for me, much less often than in other languages. I find that it's simply not a problem.

They tend to get long as you develop them. Say you need a function foo to do procedures [a b ...] on datastructure K. You first (def) the signature of the function and continue to implement helper functions a b ... since they're likely all pure functions and the functionality you need of foo is complex the namespace tends to get long.
Sometimes, but the repl is a really useful tool, to understand a new library's main functionality I often use clojure.repl/source on the function and work my way backwards on it's helper functions. I find that a lot of time Clojure libraries documentation is either cryptic or non existent, but as many in the community like to say Clojure's functions' source is self documenting.
I have no experience working in a large team, but Arthur Ulfeldt is right most changes happen in a single function, I gather it from reading the diffs of pull requests with Github's Blame feature.

It is pragmatic (clojure or not) to avoid dependencies. Naming and classifying abstract things makes our intellect feel good, but it sort of gives up when having to stitch back all the parts together. Why make three files when one will do?
Making your mind up on what an app / lib is all about, just by reading the code? There's the "what" and there's the "how". Better have a clue about the former if you want to dive into the latter. If you're on reading the code to get a clue about the purpose of an app, I'm not sure having it split amongst more files will make it easier. Think twice about your example, none of these things can exist without the others.
The difficulty with large teams is sharing up-to-date knowledge, not files or lines, thanks git. Maybe having everyone on the same single file would be a damn good thing after all?
No, large files are not a problem in clj or other tongues. Unit<->file is a totally javartificial concept that helps compilers, not men. Split the fg buffer.

In addition to the answers others have given, here are two more.
It may be that some files are long because in Clojure it's most straightforward to use one file per namespace, so that if you want all of those definitions in the same namespace, it's easier to put them in one file. One reason to want definitions to reside in the same namespace is given in #2.
The Clojure compiler won't allow certain kinds of cyclic dependencies between namespaces (other cyclic dependencies between namespaces are fine). One way to avoid an illegal cyclic dependency is to put the interdependent definitions into the same namespace. If you do that, it might make sense to pull other definitions that belong with the problematic ones into the single namespace as well. See #1 for the rest of this answer.
(My own taste is for several smaller files, although not as small as many Java class files. Also: Code is usually not as self-documenting as its author thinks. This may hold even when the author and the person reading the code later are the same person.)

Related

OOP - Do I over complicate things?

I was looking at some of my projects and comparing them to things I've seen on github and I feel like I over-think things. I like OOP but I feel like I make too many files, too many classes.
For example, on a small project I had of a game of checkers, I had so many files that could maybe all go into one file/class. How do I know when I have over-thought my solutions? Here is what some of my files look like;
|src
| |- player.cpp
| |- piece.cpp
| |- color.cpp
| |- ...
And of course, there are many more files that will deal with things like rules, setting the game, GUI, etc,. But in this short example you can see how my projects can and will get very large. Is this common, to write things in this way? Or should I simply write a player.cpp file that either contains multiple classes that, in this case, are related and would set pieces/colors/king information, etc,.

Yes, distributing your code to multiple files is a good practice, since it makes your project maintainable.
I can see your concerns on a small project (is the overhead worth it?), but in real big projects, if you don't do it that way, you will end up with people scrolling forever in a large file, and using searching trough the file to find out what they are looking for.
Try to keep your files compact, and one class per file, where every class is robust and its goal is clear.
Sometimes, we write functions to files. It would not be wise to have a file for every small, inline function, it will increase the number of files without a reason. It would be better to have a family of functions inside a file (functions related to printing for example).
At the end, it's probably opinion based which is the ideal balance between size and number of files, but I hope I made myself clear.

You are actually asking two distinct questions: "what is the good granularity for separating functionality into classes" and "what is the good practice to organize project file structure". Both are rather broad.
The answer to first one would probably be to follow a single responsibility idiom. The answer to second one would be to make folder structure resemble the namespace structure (like in boost for example). Your current approach with storing everything in src folder is not good for C++ because it will lead to longer file names to prevent names collision when classes with the same name appearing in different namespaces. Larger projects indeed tend to have too many files as one class would require 4-5 files. And that leads to yet another question of selecting appropriate granularity for projects...

People tend to worry a lot about "too many classes" or "too many files", but that's largely a historical carryover. 40 years ago when we wrote programs on punch cards, and had to carry large trays and boxes of them (and not drop them!), this certainly would have been a concern. 35 years ago when the biggest hard drive you could get for a PC was 33MB, this was a concern. Today, when you wouldn't consider buying a PC with less than 512GB of SSD, and have access to terabytes and petabytes of online storage, the number of files and number of bytes taken up by the programs are essentially immaterial to the development process.
What that means to us humans is that we should use this abundance of capacity to improve other aspects of our code. If more files helps you understand the code better, use more files. If longer file names help you understand the code better, use longer file names. If following a rule like "one .cpp and one .h file per class" helps people maintain the codebase, then follow the rule. The key is to focus on truly important issues, such as "what makes this code more maintainable, more readable, more understandable to me and my team?"
Another way to approach this is to ask if the "number of files" is a useful metric for determining if code is maintainable? While a number that is obviously too low for the app would be concerning, I wouldn't be able to tell you if 10 or 100 or 1000 was an appropriate number (at least without knowing the number of classes they contain.) Therefore, it doesn't appear to be a useful metric.
Does that mean a program should have 1000 files all piled into a single folder, all compiling and linking into a single library or executable file? It depends, but it seems that 1000 classes in the same namespace would be a bit crowded and the resultant program might be too monolithic. At some point you'll probably want to refactor the architecture into smaller, more cohesive packages, each with an appropriate area of responsibility. Of course, nobody can tell you what that magic number is, as it's completely application dependent. But it's not the number of files that drives a decision like this, it's that the files should be related to each other logically or architecturally.

Each class should be designed and programmed to accomplish one, and only one, thing
Because each class is designed to have only a single responsibility, many classes are used to build an entire application

How do I effectively manage a Clojure code base?

A coworker and I are Clojure newbies. We started a project a couple months back, but quickly found that we had a tough time dealing with our code base -- by 500 LOC we basically had no idea where to start with the debugging, when things went wrong (which was often). Instead of pairs, functions were getting lists, or numbers, or what-have-you.
Now we're starting a new but related project and migrating a lot of the old code over. But we're again hitting a wall.
We're wondering, how do we effectively manage a Clojure project, especially as we make changes to existing code?
What we've come up with:
liberal use of unit-tests
liberal use of pre-, post-conditions
informal type declarations in function comments
use defrecord/defstruct/defprotocol to implement a data model, which would really simplify testing
But post-, pre-conditions seem not to be used very often. Unit-testing + comments will only help so much. And it seems like Clojure programmers don't typically implement formal data models.
Do we just not get Clojure? How do Clojure programmers know that their code is robust and correct?

I think this is actually an evolving area - Clojure hasn't really been around long enough for all of the best practices and associated tools for managing a large code base to be developed yet.
Some suggestions from my experience:
Structure your code in a "bottom up" way - in general, the way you want to structure you code will have the "utility" code at the top of the file (or imported from another namespace) and the "business logic" code that uses these utility functions towards the end of the file. If this seems difficult to do, then it's probably a hint that your code needs some refactoring.
Tests as examples - Test code in clojure works very well both to sanity check your code but also as documentation (e.g. "what kind of parameter is this function expecting?"). If you hit a bug, refer to your tests to check your assumptions and write a couple of new tests to flush out what is going wrong.
Keep functions simple and compose them - Kind of an extension of the "single responsibility principle" to functional programming. I consider more than 5-10 lines in a Clojure function as a major code smell (if this seems extreme, just remember that you can probably achieve as much in 5-10 lines of Clojure as you could with 50-100 lines of Java/C#)
Watch out for "imperative habits" - when I first started using Clojure, I wrote a lot of pseudo-imperative code in Clojure. An example would be emulating a for loop with "dotimes" and accumulating some result within an atom. This can be painful - it's not idiomatic, it's confusing and usually there is a much smarter, simpler and less error-prone functional way of doing it. This takes practice, but it is worth it in the long run...
Debug at the REPL - usually when I hit an issue, coding at the REPL is the easiest way to flush it out. Generally this means running some specific parts of the larger algorithm to check assumptions etc.
Refactor common utility functions out - you'll probably find a bunch of common or structure repeated in many functions. Well worth pulling this out into a function or macro that you can re-use in other places or projects - that way you can test it much more rigorously and have the benefits in multiple places. Bonus points if you can get it all the way upstream into Clojure itself! If you do this well enough, then your main code base will be extremely succinct and therefore easy to manage, containing nothing but the genuinely domain-specific code.

simple composable abstractions
"It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures." - Alan J. Perlis
For me its all about composing simple functions. Try to break every function down into the smallest units you can and then have another function that composes them to do the work your need. You know you are in good shape is every function can be tested independently. If you go too heavy on the macroes then it can make this step harder because macroes compose differently.
D.R.Y, Seriously, just don't repeat yourself
starting with well decomposed functions in a a bunch of namespaces; every time I need one of the composable parts somewhere else I "hoist" that function up to a library included by both namespaces. This way your commonly used abstractions sort of evolve over the course of the project into "just enough framework". It is very difficult to do this unless you really have discrete composable abstractions.

Sorry to dig up this old question, the answers by mikera and Arthur are excellent, but it's something I've also wondered about as I've been learning Clojure, and thought I'd mention how we organise files.
In a similar vein to ensuring each function has a single job, we group related functions into namespaces to make it easier to navigate the code. So we might have a namespace for functions providing access to a particular database, or providing a collection of HTTP-related utilities. This keeps each file relatively small, and makes tests easier to find. It also makes refactoring much more straightforward. This is hardly anything new, but it's worth bearing in mind.

Organizing large C++ project

Should all C++ code in a project be encapsulated into a single class with main simply calling that class? Or should the main function declare variables and classes.

If you are going to build a large project in C++, you should at the very least read Large Scale C++ Software Design by John Lakos about it. It's a little old but it sounds like you could benefit from the fundamentals in it.
Keep in mind that building a large scale system in any language is a challenge and requires skill and discipline to prevent it falling to pieces very quickly. Don't take it lightly.
That said, if your definition of "large" is different than mine then I may have alternative advice to give you. I'm assuming you're talking about a project where the word "million" will be mentioned in sentences that also contain the words "lines of code".

for large C++ projects, you should create many classes!
main should just kick things off (maybe doing a few housekeeping things) and then calling into a class that will fire up the rest of the system

If it's a class that really makes sense, sure -- but at least IME, that's a fairly rare exception, not the general rule.
Here, I'm presuming that you don't really mean all the code is in one class, but that there's a single top-level class, so essentially all main does is instantiate and use it. That class, in turn, will presumably instantiate and use other subordinate classes.
If you really mean "should all the code being contained in a single class?", then the answer is almost certainly a resounding "no", except for truly minuscule projects. Much of the design of classes in C++ (and most other OO languages) is completely pointless if all the code is in one class.

If you can put your entire project in one class without going insane, your definition of "large" may be different than most people's here. Which is fine -- just keep in mind when you ask people about a "large" c++ project, they will assume you're talking about something that takes multiple person-years to create.
That said, the same principles of encapsulation apply no matter what the size of the project. Break your logic and data into units that make sense and are not too tied together and then organize your class(es) around those divisions. Don't be afraid to try one organization and then refactor it into another organization if you find yourself copy-pasting code, or if you find one class depending too heavily on another. (Or if you find yourself with too many classes and you're creating many objects to accomplish one task where a single object would be cleaner and easier on you.)
Have fun and don't be afraid to experiment a little.

In C++ you should avoid putting entire project in one class, irrespective of big or small. At the max you can try putting it in 1 or 2 namespace (which can be split across the files).
The advantage of having multiple classes are,
Better maintainability of your code
Putting classes in multiple .h and .cpp files (i.e. small modules) help you fast debugging
If all code is in one class and changes are made somewhere then one has to compile whole project. Instead, if project is across modules, one can just compile the module where changes are made. It saves time a lot.

No! Each header/implementation file pair should represent a single class. Placing a large project in one file is a surefire way to disaster: the project become unmaintainable and compiling will take ages. Break up your code in to appropriately sized pieces.
The main function should not declare the classes, rather, the file it contains (often named something like main.cpp, driver.cpp, projectname.cpp) should use #include directives to make the compiler read the declarations in header files. Read up on C++'s separate compilation model for more info.
Some newcomers to C++ find the compilation model - as well as error codes generated when you screw it up - incomprehensible or intimidating and give up thinking it's not worth it. Don't let this be you. Learn how to properly organize your code.

Common problems with Clojure multi-methods and protocols?

I am asking this question as I am starting to really use multimethods and protocols alot, but in doing so I'm also wondering if I'm making my code too un-maintainable. For example in the good old (or bad old :) OO days I would know where to find everything related to a particulat type, which would mean that all interfaces and methods would be in the same source file, but now they can be spread out all over the place. Any experiences on this?

It is true that everything can be scattered in different places if you are not forced to organize code in certain ways, like Java forces you.
It's completly up to you as a developer to organize code in logical units so it could be easier to find them and keep in mind that With great power comes great responsibility.
The more you work in functional style, you'll find better ways to organize your code, the key is that you are not afraid of refactoring. Besides M-. in Emacs/Slime will bring you to the definition of symbol wherever you are. I suppose other Clojure IDE plugins have a similar feature.

What to do about a 11000 lines C++ source file?

So we have this huge (is 11000 lines huge?) mainmodule.cpp source file in our project and every time I have to touch it I cringe.
As this file is so central and large, it keeps accumulating more and more code and I can't think of a good way to make it actually start to shrink.
The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it. If I were to "simply" split it up, say for a start, into 3 files, then merging back changes from maintenance versions will become a nightmare. And also if you split up a file with such a long and rich history, tracking and checking old changes in the SCC history suddenly becomes a lot harder.
The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows. :-(
What would you do in this situation? Any ideas on how to move new features to a separate source file without messing up the SCC workflow?
(Note on the tools: We use C++ with Visual Studio; We use AccuRev as SCC but I think the type of SCC doesn't really matter here; We use Araxis Merge to do actual comparison and merging of files)

Merging will not be such a big nightmare as it will be when you'll get 30000 LOC file in the future. So:
Stop adding more code to that file.
Split it.
If you can't just stop coding during refactoring process, you could leave this big file as is for a while at least without adding more code to it: since it contains one "main class" you could inherit from it and keep inherited class(es) with overloaded functions in several new small and well designed files.

Find some code in the file which is relatively stable (not changing fast, and doesn't vary much between branches) and could stand as an independent unit. Move this into its own file, and for that matter into its own class, in all branches. Because it's stable, this won't cause (many) "awkward" merges that have to be applied to a different file from the one they were originally made on, when you merge the change from one branch to another. Repeat.
Find some code in the file which basically only applies to a small number of branches, and could stand alone. Doesn't matter whether it's changing fast or not, because of the small number of branches. Move this into its own classes and files. Repeat.
So, we've got rid of the code that's the same everywhere, and the code that's specific to certain branches.
This leaves you with a nucleus of badly-managed code - it's needed everywhere, but it's different in every branch (and/or it changes constantly so that some branches are running behind others), and yet it's in a single file that you're unsuccessfully trying to merge between branches. Stop doing that. Branch the file permanently, perhaps by renaming it in each branch. It's not "main" any more, it's "main for configuration X". OK, so you lose the ability to apply the same change to multiple branches by merging, but this is in any case the core of code where merging doesn't work very well. If you're having to manually manage the merges anyway to deal with conflicts, then it's no loss to manually apply them independently on each branch.
I think you're wrong to say that the kind of SCC doesn't matter, because for example git's merging abilities are probably better than the merge tool you're using. So the core problem, "merging is difficult" occurs at different times for different SCCs. However, you're unlikely to be able to change SCCs, so the issue is probably irrelevant.

It sounds to me like you're facing a number of code smells here. First of all the main class appears to violate the open/closed principle. It also sounds like it is handling too many responsibilities. Due to this I would assume the code to be more brittle than it needs to be.
While I can understand your concerns regarding traceability following a refactoring, I would expect that this class is rather hard to maintain and enhance and that any changes you do make are likely to cause side effects. I would assume that the cost of these outweighs the cost of refactoring the class.
In any case, since the code smells will only get worse with time, at least at some point the cost of these will outweigh the cost of refactoring. From your description I would assume that you're past the tipping point.
Refactoring this should be done in small steps. If possible add automated tests to verify current behavior before refactoring anything. Then pick out small areas of isolated functionality and extract these as types in order to delegate the responsibility.
In any case, it sounds like a major project, so good luck :)

The only solution I have ever imagined to such problems follows. The actual gain by the described method is progressiveness of the evolutions. No revolutions here, otherwise you'll be in trouble very fast.
Insert a new cpp class above the original main class. For now, it would basically redirect all calls to the current main class, but aim at making the API of this new class as clear and succinct as possible.
Once this has been done, you get the possibility to add new functionalities in new classes.
As for existing functionalities, you have to progressively move them in new classes as they become stable enough. You will lose SCC help for this piece of code, but there is not much that can be done about that. Just pick the right timing.
I know this is not perfect, though I hope it can help, and the process must be adapted to your needs!
Additional information
Note that Git is an SCC that can follow pieces of code from one file to another. I have heard good things about it, so it could help while you are progressively moving your work.
Git is constructed around the notion of blobs which, if I understand correctly, represent pieces of code files. Move these pieces around in different files and Git will find them, even if you modify them. Apart from the video from Linus Torvalds mentioned in comments below, I have not been able to find something clear about this.

Confucius say: "first step to getting out of hole is to stop digging hole."

Let me guess: Ten clients with divergent feature sets and a sales manager that promotes "customization"? I've worked on products like that before. We had essentially the same problem.
You recognize that having an enormous file is trouble, but even more trouble is ten versions that you have to keep "current". That's multiple maintenance. SCC can make that easier, but it can't make it right.
Before you try to break the file into parts, you need to bring the ten branches back in sync with each other so that you can see and shape all the code at once. You can do this one branch at a time, testing both branches against the same main code file. To enforce the custom behavior, you can use #ifdef and friends, but it's better as much as possible to use ordinary if/else against defined constants. This way, your compiler will verify all types and most probably eliminate "dead" object code anyway. (You may want to turn off the warning about dead code, though.)
Once there's only one version of that file shared implicitly by all branches, then it's rather easier to begin traditional refactoring methods.
The #ifdefs are primarily better for sections where the affected code only makes sense in the context of other per-branch customizations. One may argue that these also present an opportunity for the same branch-merging scheme, but don't go hog-wild. One colossal project at a time, please.
In the short run, the file will appear to grow. This is OK. What you're doing is bringing things together that need to be together. Afterwards, you'll begin to see areas that are clearly the same regardless of version; these can be left alone or refactored at will. Other areas will clearly differ depending on the version. You have a number of options in this case. One method is to delegate the differences to per-version strategy objects. Another is to derive client versions from a common abstract class. But none of these transformations are possible as long as you have ten "tips" of development in different branches.

I don't know if this solves your problem, but what I guess you want to do is migrate the content of the file to smaller files independent of each other (summed up).
What I also get is that you have about 10 different versions of the software floating around and you need to support them all without messing things up.
First of all there is just no way that this is easy and will solve itself in a few minutes of brainstorming. The functions linked in your file are all vital to your application, and simply cutting them of and migrating them to other files won't save your problem.
I think you only have these options:
Don't migrate and stay with what you have. Possibly quit your job and start working on serious software with good design in addition. Extreme programming is not always the best solution if you are working on a long time project with enough funds to survive a crash or two.
Work out a layout of how you would love your file to look once it's split up. Create the necessary files and integrate them in your application. Rename the functions or overload them to take an additional parameter (maybe just a simple boolean?).
Once you have to work on your code, migrate the functions you need to work on to the new file and map the function calls of the old functions to the new functions.
You should still have your main-file this way, and still be able to see the changes that were made to it, once it comes to a specific function you know exactly when it was outsourced and so on.
Try to convince your co-workers with some good cake that workflow is overrated and that you need to rewrite some parts of the application in order to do serious business.

Exactly this problem is handled in one of the chapters of the book "Working Effectively with Legacy Code" (http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052).

I think you would be best off creating a set of command classes that map to the API points of the mainmodule.cpp.
Once they are in place, you will need to refactor the existing code base to access these API points via the command classes, once that's done, you are free to refactor each command's implementation into a new class structure.
Of course, with a single class of 11 KLOC the code in there is probably highly coupled and brittle, but creating individual command classes will help much more than any other proxy/facade strategy.
I don't envy the task, but as time goes on this problem will only get worse if it's not tackled.
Update
I'd suggest that the Command pattern is preferable to a Facade.
Maintaining/organizing a lot of different Command classes over a (relatively) monolithic Facade is preferable. Mapping a single Facade onto a 11 KLOC file will probably need to be broken up into a few different groups itself.
Why bother trying to figure out these facade groups? With the Command pattern you will be able to group and organise these small classes organically, so you have a lot more flexibility.
Of course, both options are better than the single 11 KLOC and growing, file.

One important advice: Do not mix refactoring and bugfixes. What you want is a Version of your program that is identical to the previous version, except that the source code is differently.
One way could be to start splitting up the least big function/part into it's own file and then either include with a header (thus turning main.cpp into a list of #includes, which sounds a code smell in itself *I'm not a C++ Guru though), but at least it's now split into files).
You could then try to switch all maintenance releases over to the "new" main.cpp or whatever your structure is. Again: No other changes or Bugfixes because tracking those is confusing as hell.
Another thing: As much as you may desire making one big pass at refactoring the whole thing in one go, you might bite off more than you can chew. Maybe just pick one or two "parts", get them into all the releases, then add some more value for your customer (after all, Refactoring does not add direct value so it is a cost that has to be justified) and then pick another one or two parts.
Obviously that requires some discipline in the team to actually use the split files and not just add new stuff to the main.cpp all the time, but again, trying to do one massive refactor may not be the best course of action.

Rofl, this reminds me of my old job. It seems that, before I joined, everything was inside one huge file (also C++). Then they've split it up (at completely random points using includes) into about three (still huge files). The quality of this software was, as you might expect, horrible. The project totaled at about 40k LOC. (containing almost no comments but LOTS of duplicate code)
In the end I did a complete rewrite of the project. I started by redoing the worst part of the project from scratch. Of course I had in mind a possible (small) interface between this new part and the rest. Then I did insert this part into the old project. I didn't refactor the old code to create the interface necessary, but just replaced it. Then I took made small steps from there, rewriting the old code.
I have to say that this took about half a year and there was no development of the old code base beside bugfixes during that time.
edit:
The size stayed at about 40k LOC but the new application contained many more features and presumably less bugs in its initial version than the 8 year old software. One reason of the rewrite was also that we needed the new features and introducing them inside the old code was nearly impossible.
The software was for an embedded system, a label printer.
Another point that I should add is that in theory the project was C++. But it wasn't OO at all, it could have been C. The new version was object oriented.

OK so for the most part rewriting API of production code is a bad idea as a start. Two things need to happen.
One, you need to actually have your team decide to do a code freeze on current production version of this file.
Two, you need to take this production version and create a branch that manages the builds using preprocessing directives to split up the big file. Splitting the compilation using JUST preprocessor directives (#ifdefs, #includes, #endifs) is easier than recoding the API. It's definitely easier for your SLAs and ongoing support.
Here you could simply cut out functions that relate to a particular subsystem within the class and put them in a file say mainloop_foostuff.cpp and include it in mainloop.cpp at the right location.
OR
A more time consuming but robust way would be to devise an internal dependencies structure with double-indirection in how things get included. This will allow you to split things up and still take care of co-dependencies. Note that this approach requires positional coding and therefore should be coupled with appropriate comments.
This approach would include components that get used based on which variant you are compiling.
The basic structure is that your mainclass.cpp will include a new file called MainClassComponents.cpp after a block of statements like the following:
#if VARIANT == 1
# define Uses_Component_1
# define Uses_Component_2
#elif VARIANT == 2
# define Uses_Component_1
# define Uses_Component_3
# define Uses_Component_6
...
#endif
#include "MainClassComponents.cpp"
The primary structure of the MainClassComponents.cpp file would be there to work out dependencies within the sub components like this:
#ifndef _MainClassComponents_cpp
#define _MainClassComponents_cpp
/* dependencies declarations */
#if defined(Activate_Component_1)
#define _REQUIRES_COMPONENT_1
#define _REQUIRES_COMPONENT_3 /* you also need component 3 for component 1 */
#endif
#if defined(Activate_Component_2)
#define _REQUIRES_COMPONENT_2
#define _REQUIRES_COMPONENT_15 /* you also need component 15 for this component */
#endif
/* later on in the header */
#ifdef _REQUIRES_COMPONENT_1
#include "component_1.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_2
#include "component_2.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_3
#include "component_3.cpp"
#endif
#endif /* _MainClassComponents_h */
And now for each component you create a component_xx.cpp file.
Of course i am using numbers but you should use something more logical based on your code.
Using preprocessor allows you to split things up without having to worry about API changes which is a nightmare in production.
Once you have production settled you can then actually work on redesign.

Well I understand your pain :) I've been in a few such projects as well and it's not pretty. There is no easy answer for this.
One approach that may work for you is to start adding safe guards in all functions, that is, checking arguments, pre/post-conditions in methods, then eventually adding unit tests all in order to capture the current functionality of the sources. Once you have this you are better equipped to re-factor the code because you will have asserts and errors popping up alerting you if you have forgotten something.
Sometimes though there are times when refactoring just may bring more pain than benefit. Then it may be better to just leave the original project and in a pseudo maintenance state and start from scratch and then incrementally adding the functionality from the beast.

You should not be concerned with reducing the file-size, but rather with reducing the class-size. It comes down to almost the same, but makes you look at the problem from a different angle (as #Brian Rasmussen suggests, your class seems to have to many responsibilities).

What you have is a classic example a known design antipattern called the blob. Take some time to read the article I point here, and maybe you may find something useful. Besides, if this project is as big as it looks, you should consider some design to prevent growing into code that you can't control.

This isn't an answer to the big problem, but a theoretical solution to a specific piece of it:
Figure out where you want to split the big file into subfiles. Put comments in some special format at each of those points.
Write a fairly trivial script that will break the file apart into subfiles at those points. (Perhaps the special comments have embedded filenames that the script can use as instructions for how to split it.) It should preserve the comments as part of the splitting.
Run the script. Delete the original file.
When you need to merge from a branch, first recreate the big file by concatenating the pieces back together, do the merge, and then re-split it.
Also, if you want to preserve the SCC file history, I expect the best way to do that is to tell your source control system that the individual piece files are copies of the original. Then it will preserve the history of the sections that were kept in that file, although of course it will also record that large parts were "deleted".

One way to split it without too much danger would be to take a historic look at all the line changes. Are there certain functions that are more stable than others? Hot spots of change if you will.
If a line hasn't been changed in a few years you can probably move it to another file without too much worry. I'd take a look at the source annotated with the last revision that touched a given line and see if there are any functions you could pull out.

Wow, sounds great. I think explaining to your boss, that you need a lot of time to refactor the beast is worth a try. If he doesn't agree, quitting is an option.
Anyway, what I suggest is basically throwing out all the implementation and regrouping it into new modules, let's call those "global services". The "main module" would only forward to those services and ANY new code you write will use them instead of the "main module". This should be feasible in a reasonable amount of time (because it's mostly copy and paste), you don't break existing code and you can do it one maintenance version at a time. And if you still have any time left, you can spend it refactoring all old depending modules to also use the global services.

Do not ever touch this file and the code again!
Treat is like something you are stuck with. Start writing adapters for the functionality encoded there.
Write new code in different units and talk only to adapters which encapsulate the functionality of the monster.
... if only one of the above is not possible, quit the job and get you a new one.

My sympathies - in my previous job I encountered a similar situation with a file that was several times larger than the one you have to deal with. Solution was:
Write code to exhaustively test the function in the program in question. Sounds like you won't already have this in hand...
Identify some code that can be abstracted out into a helper/utilities class. Need not be big, just something that is not truly part of your 'main' class.
Refactor the code identified in 2. into a separate class.
Rerun your tests to ensure nothing got broken.
When you have time, goto 2. and repeat as required to make the code manageable.
The classes you build in step 3. iterations will likely grow to absorb more code that is appropriate to their newly-clear function.
I could also add:
0: buy Michael Feathers' book on working with legacy code
Unfortunately this type of work is all too common, but my experience is that there is great value in being able to make working but horrid code incrementally less horrid while keeping it working.

Consider ways to rewrite the entire application in a more sensible way. Maybe rewrite a small section of it as a prototype to see if your idea is feasible.
If you've identified a workable solution, refactor the application accordingly.
If all attempts to produce a more rational architecture fail, then at least you know the solution is probably in redefining the program's functionality.

My 0.05 eurocents:
Re-design the whole mess, split it into subsystems taking into account the technical and business requirements (=many parallel maintenance tracks with potentially different codebase for each, there is obviously a need for high modifiability, etc.).
When splitting into subsystems, analyze the places which have most changed and separate those from the unchanging parts. This should show you the trouble-spots. Separate the most changing parts to their own modules (e.g. dll) in such a way that the module API can be kept intact and you don't need to break BC all the time. This way you can deploy different versions of the module for different maintenance branches, if needed, while having the core unchanged.
The redesign will likely need to be a separate project, trying to do it to a moving target will not work.
As for the source code history, my opinion: forget it for the new code. But keep the history somewhere so you can check it, if needed. I bet you won't need it that much after the beginning.
You most likely need to get management buy-in for this project. You can argue perhaps with faster development time, less bugs, easier maintaining and less overall chaos. Something along the lines of "Proactively enable the future-proofness and maintenance viability of our critical software assets" :)
This is how I'd start to tackle the problem at least.

Start by adding comments to it. With reference to where functions are called and if you can move things around. This can get things moving. You really need to assess how brittle the code base it. Then move common bits of functionality together. Small changes at a time.

Another book you may find interesting/helpful is Refactoring.

Something I find useful to do (and I'm doing it now although not at the scale you face), is to extract methods as classes (method object refactoring). The methods that differ across your different versions will become different classes which can be injected into a common base to provide the different behaviour you need.

I found this sentence to be the most interesting part of your post:
> The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it
First, I would recommend that you use a source control system for developing these 10 + maintenance versions that supports branching.
Second, I would create ten branches (one for each of your maintenance versions).
I can feel you cringing already! But either your source control isn't working for your situation because of a lack of features, or it's not being used correctly.
Now to the branch you work on - refactor it as you see fit, safe in the knowledge that you'll not upset the other nine branches of your product.
I would be a bit concerned that you have so much in your main() function.
In any projects I write, I would use main() only perform initialization of core objects - like a simulation or application object - these classes is where the real work should go on.
I would also initialize an application logging object in main for use globally throughout the program.
Finally, in main I also add leak detection code in preprocessor blocks that ensure it's only enabled in DEBUG builds. This is all I would add to main(). Main() should be short!
You say that
> The file basically contains the "main class" (main internal work dispatching and coordination) of our program
It sounds like these two tasks could be split into two separate objects - a co-ordinator and a work dispatcher.
When you split these up, you may mess up your "SCC workflow", but it sounds like adhering stringently to your SCC workflow is causing software maintenance problems. Ditch it, now and don't look back, because as soon as you fix it, you'll begin to sleep easy.
If you're not able to make the decision, fight tooth and nail with your manager for it - your application needs to be refactored - and badly by the sounds of it! Don't take no for an answer!

As you've described it, the main issue is diffing pre-split vs post-split, merging in bug fixes etc.. Tool around it. It won't take that long to hardcode a script in Perl, Ruby, etc. to rip out most of the noise from diffing pre-split against a concatenation of post-split. Do whatever's easiest in terms of handling noise:
remove certain lines pre/during concatenation (e.g. include guards)
remove other stuff from the diff output if necessary
You could even make it so whenever there's a checkin, the concatenation runs and you've got something prepared to diff against the single-file versions.

"The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows."
If that big SWITCH (which I think there is) becomes the main maintenance problem, you could refactor it to use dictionary and the Command pattern and remove all switch logic from the existing code to the loader, which populates that map, i.e.:
// declaration
std::map<ID, ICommand*> dispatchTable;
...
// populating using some loader
dispatchTable[id] = concreteCommand;
...
// using
dispatchTable[id]->Execute();

I think the easiest way to track the history of source when splitting a file would be something like this:
Make copies of the original source code, using whatever history-preserving copy commands your SCM system provides. You'll probably need to submit at this point, but there's no need yet to tell your build system about the new files, so that should be ok.
Delete code from these copies. That should not break the history for the lines you keep.

I think what I would do in this situation is bit the bullet and:
Figure out how I wanted to split the file up (based on the current development version)
Put an administrative lock on the file ("Nobody touch mainmodule.cpp after 5pm Friday!!!"
Spend your long weekend applying that change to the >10 maintenance versions (from oldest to newest), up to and including the current version.
Delete mainmodule.cpp from all supported versions of the software. It's a new Age - there is no more mainmodule.cpp.
Convince Management that you shouldn't be supporting more than one maintenance version of the software (at least without a big $$$ support contract). If each of your customers have their own unique version.... yeeeeeshhhh. I'd be adding compiler directives rather than trying to maintain 10+ forks.
Tracking old changes to the file is simply solved by your first check-in comment saying something like "split from mainmodule.cpp". If you need to go back to something recent, most people will remember the change, if it's 2 year from now, the comment will tell them where to look. Of course, how valuable will it be to go back more than 2 years to look at who changed the code and why?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js