Hardcoding Parameters vs. loading from a file - c++

I am working on a motion control system, and will have at least 5 motors, each with parameters such as "gearbox ratio", "ticks per rev" "Kp", "Ki", "Kd", etc. that will be referenced upon construction of instances of the motors.
My question to StackOverflow is how should I organize these numbers? I know this is likely a preferential thing, but being new to coding I figure I could get some good opinions from you.
The three approaches I immediately see are as follows:
Write in the call to the constructor, either via variables or numbers-- PROS: limited coding, could be implemented in a way that it's easy to change, but possibly harder than #define's
Use #define's to accomplish similar to above -- PROS: least coding, easy to change (assuming you want to look at the source code)
Load a file (possibly named "motorparameters.txt") and load the parameters into an array and populate from that array. If I really wanted to I could add a GUI approach to changing this file rather than manual. -- PROS:easiest to change without diving into source code.
These parameters could change over time, and while there are other coders at the company, I would like to leave it in a way that's easy to configure. Do any of you see a particular benefit of #define vs. variables? I have a "constants.h" file already that I could easily add the #defines to, or I could add variables near the call to the constructor.

There's a principle know as YAGNI (You Ain't Gonna Need It) which says do the simplest thing first, then change it when (if) your requirements expand.

Sounds to me like the thing to do is:
Write a flexible motor class, that can handle any values (within reason), even though there are only 5 different sets of values you currently care about.
Define a component that returns the "right" values for the 5 motors in your system (or that constructs the 5 motors for your system using the "right values")
Initially implement that component to use some hard-coded values out of a header file
Retain the option to replace that component in future with an implementation of the same API, but that reads values out of a resource file, text file, XML file, GUI interaction with the user, off the internet, by making queries to the hardware to find out what motors it thinks it has, whatever.
I say this on the basis that you minimize expected effort by putting in a point of customizability where you suspect you'll want one (to prevent a lot of work when you change it later), but implement using the simplest thing that satisfies your current certain requirements.
Some people might say that it's not actually worth doing the typing (a) to define the component, better just to construct 5 motors in main() (b) to use constants from a header file, better just to type numeric literals in main(). The (b) people are widely despised as peddlers of "magic constants" (which doesn't mean they're necessarily wrong about relative total programming time by implementer and future maintainers, they just probably are) and the (a) people divide opinion. I tend to figure that defining this kind of thing takes a few minutes, so I don't really care whether it's worth it or not. Loading the values out of a file involves specifying a file format that I might regret as soon as I encounter a real reason to customize, so personally I can't be bothered with that until the requirement arises.

The general idea is to separate the portions of your code that will change from those that won't. The more likely something is to change, the more you need to make it easy to change.
If you're building a commercial app where hundreds or thousands of users will use many different motors, it might make sense to code up a UI and store the data in a config file.
If this is development code and these parameters are unlikely to change, then stuffing them into #defines in your constants.h file is probably the way to go.

Number 3 is a great option if you don't have security or IP concerns. Anytime you or someone else touches your code, you introduce the possibility of regressions. By keeping your parameters in a text file, not only are you making life easier on yourself, you're also reducing the scope of possible errors down the road.

Related

Should method names be easy to remember?

Are there any official C++ recommendations that concern with the amount of information that should be disclosed in a method name? I am asking because I can find plenty of references in Internet but none that really explains this.
I'm working on a C++ class with a method called calculateIBANAndBICAndSaveRecordChainIfChanged, which pretty well explains what the method does. A shorter name would be easier to remember and would need no intellisense or copy & paste to type. It would be less descriptive, true, but functionality is supposed to be documented.
calculateIBANAndBICAndSaveRecordChainIfChanged considered to be a bad function name, it breaks the rule of one-function-does-one-thing.
Reduce complexity
The single most important reason to create a routine is to reduce a program's complexity. Create a routine to hide information so that you won't need to think about it. Sure, you'll need to think about it when you write the routine. But after it's written, you should be able to forget the details and use the routine without any knowledge of its internal workings. Other reasons to create routines—minimizing code size, improving maintainability, and improving correctness—are also good reasons, but without the abstractive power of routines, complex programs would be impossible to manage intellectually.
You could simply break this function into below functions:
CalculateIBAN
CalculateBIC
SaveRecordChain
IsRecordChainChanged
To name a procedure, use a strong verb followed by an object
A procedure with functional cohesion usually performs an operation on an object. The name should reflect what the procedure does, and an operation on an object implies a verb-plus-object name. PrintDocument(), CalcMonthlyRevenues(), CheckOrderlnfo(), and RepaginateDocument() are samples of good procedure names.
Describe everything the routine does
In the routine's name, describe all the outputs and side effects. If a routine computes report totals and opens an output file, ComputeReportTotals() is not an adequate name for the routine. ComputeReportTotalsAndOpen-OutputFile() is an adequate name but is too long and silly. If you have routines with side effects, you'll have many long, silly names. The cure is not to use less-descriptive routine names; the cure is to program so that you cause things to happen directly rather than with side effects.
Avoid meaningless, vague, or wishy-washy verbs
Some verbs are elastic, stretched to cover just about any meaning. Routine names like HandleCalculation(), PerformServices(), OutputUser(), ProcessInput(), and DealWithOutput() don't tell you what the routines do. At the most, these names tell you that the routines have something to do with calculations, services, users, input, and output. The exception would be when the verb "handle" was used in the specific technical sense of handling an event.
Most of above points are referred from Code complete II. Other good books are Clean Code, The Clean Coder from Robert C. Martin
To answer the direct question, I don't think function names need to be memorable. It's nice if they are, but like you say this stuff is supposed to be documented. I can look it up.
calculateIBANAndBICAndSaveRecordChainIfChanged is too long for my taste. Aside from the inconvenience of having to c/p or auto-complete to even use them, my fear with long function names is that I don't read them properly either, so names with similar "shapes" start to look confusingly similar to one another.
So I would advise looking for a shorter name. There must be some reason why these operations (calculating two things, and conditionally saving a record chain) have been grouped together. That reason isn't described in the question, it lies somewhere in the specification or the history of your project. You should identify that reason and look to it for a more succinct function name.
When naming a function you can also consider what reasons[*] the function might change in future. Why are there two things (IBANA and BIC) that are calculated at the same time? What is the relationship between them? Can you identify the reason for doing both at once and then saving?
For example: they are the "acronyms" for this object, it's common to want to recalculate the acronyms all at once, and if you recalculate then naturally the changes need saving. Then call the function refreshAcronyms. Maybe there will be a third acronym in future.
For another example: what callers really want is to save the object if changed, and it's an additional chore that to preserve integrity of the stored data, I must always recalculate the IBANA and the BIC before saving. In that case, all the rest is necessary precursors to saving, so I can call the function saveRecordChain. Users of the public interface just need to know that the save function does what needs to be done. There might be a serializeToFile() function in the private interface that saves if changed without doing the extra stuff.
[*] I say "reasons" plural, but Robert C Martin defines the "single responsibility principle" to be that there is only one possible reason to change a well-designed function.
Ideally one method should do only one thing. And your method name should reflect what it does (that one thing), then only your program become readable.
It''s a matter of personal preference although I would think that calculateIBANAndBICAndSaveRecordChainIfChanged is too long and therefore difficult to read and code with (unless you're using a smart editor that can auto-complete)
Two further points:
The function needs to be broken down into smaller parts, as other
posters have suggested.
There's no law against commenting your headers to give a more
detailed description of the function there so you don't have to
build every aspect of its functionality into the name.
You read and write too many methods over the course of your career to remember their names. Most programmers would need to look up a name of a function from their language's standard library, let alone names of functions that their or their team developed! The most memorable function name would be of no use to someone maintaining your code and seeing the call for the first time. Moreover, good chances are that in six months you wouldn't remember it either!
That is why I recommend going for descriptive names first, and not worrying about the ease of memorization: after all, IDEs with intellisense are not going away any time soon (and they were introduced for a good reason - to address our memory limitations).
For personal interaction that would be enough and useful, but any way after completing the app you have to re-factor every function name to exactly what they intend to do. And if working in a group or in company make it sure that function name reflects what its functionality is.
And in your eg function name i may name it like: saveRecordWithRespctToIBANandBIC()

Efficient ways to save and load data from C++ simulation

I would like to know which are the best way to save and load C++ data.
I am mostly interested in saving classes and matrices (not sparse) I use in my simulations.
Now I just save them as txt files, but if I add a member to a class I then have to modify the function that loads the data (it has to parse and check for the value in the txt file),
that I think is not ideal.
What would you recommend in general? (p.s. as I'd like to release my code I'd really like to use only standard c++ or libraries that can be redistributed).
In this case, there is no "best." What is best for you is highly dependent upon your situation. But, lets have an example to get you thinking about your details and how deep this rabbit hole can go.
If you absolutely positively must have the fastest save possible without question (and you're willing to pay the price), you can define your own memory management to put all objects into a contiguous array of a common type (such as integers). This allows you to write that array to disk as binary data very rapidly. You might need this in a simulation that uses threads efficiently to load every core/processor to run at real time.
Why is a rather horrible solution? Because it takes a LOT of work and runs many risks for problems in the name of "optimization."
It requires you to build your own memory management (operator new() and operator delete()) which may need to be thread safe.
If you try to load from this array, you will have to placement new all objects with a unique non-modifying constructor in order to ensure all virtual pointers are set properly. Oh, and you have to track the type of each address to now how to do this.
For portability with other systems and between versions of the binary, you will need to have utilities to convert from the binary format to something generic enough to be cross platform (including repopulating pointers to other objects).
I have done this. It was highly unpleasant. I have no doubt there are still problems with it and I have only listed a few here. But, it was very, very fast and very, very, very problematic.
You must design to your needs. Generally, the first need is "Make it work." Don't care about efficiency, just about something that accurately persists and that you have the information known and accessible at some point to do it. Also, you should encapsulate the process of saving and loading. Then, if the need "Make it better" steps in, you should be able to change that one bit of code and the rest should work. You might even make the saving format selectable on user needs instead of your needs which you must assume for all users.
Given all the assumptions, pros and cons listed, you should be able to elaborate your particular needs for this question.
Given that performance is not your concern -- which is a critical part of the answer -- the Boost Serialization library is a great answer.
The link in the comment leads to the documentation. Read the tutorial (which is overkill for what you are initially wanting, but well worth it).
Finally, since you have mostly array matrices, try to encapsulate the entire process of save and load so that should you need to change it later, you are writing a new implementatio and choosing between the exisiting. I expend the eddedmtime for the smarts of Boost Serialization would not be great; however, you might find a future requirement moves you to something else or multiple something elses.
The C++ Middleware Writer automates the creation of marshalling functions. When you add a member to a class, it updates the marshalling functions for you.

Global variables (again)

I keep hearing that global variables should never be used, but I have a tendency to dismiss "never" rules as hot-headed. Are there really no exceptions?
For instance, I am currently writing a small game in c++ with SDL. It seems to me to make a lot of sense to have a global variable with a pointer to the screen buffer, because all the different class that represent the different type of things in the game will need to blit to it, and there is only one screen buffer.
Please tell me if I am right that there are exceptions, or if not then:
Why not, or what is so bad about them that they should be avoided at all costs (please explain a little bit)
How can this be achieved, preferably without having to pass it to every constructer to be stored internally until needed, or to every call to a paint() request.
(I would assume that this question had been asked on SO before, however couldn't find what I need (an explanation and workaround) when searching. If someone could just post a link to a previous question, that could be great)
We tell students never to use global variables because it encourages better programming methods. It's the same reason we tell them not to use a goto statement. Once you're an accomplished programmer then you can break the rules because you should know when it's appropriate.
Of course there are exceptions. I personally can't think of a single situation where a goto is the right solution (or where a singleton is the right solution), but global variables occasionally have their uses. But... you haven't found a valid excuse.
Most objects in your game do not, repeat, not need to access the screen buffer. That is the responsibility of the renderer and no one else. You don't want your logger, input manager, AI or anyone else putting random garbage on the screen.
And that is why people say "don't use globals". It's not because globals are some kind of ultimate evil, but because if we don't say this, people fall into the trap you're in, of "yeah but that rule doesn't apply to me, right? I need everything to have access to X". No, you need to learn to structure your program.
More common exceptions are for state-less or static objects, like a logger, or perhaps your app's configuration: things that are either read-only or write-only, and which truly needs to be accessible from everywhere. Every line of code may potentially need to write a log message. So a logger is a fair candidate for making global. But 99% of your code should not even need to know that a screen buffer exists.
The problem with globals is, in a nutshell, that they violate encapsulation:
Code that depends on a global is less reusable. I can take the exact same class you're using, put it in my app, and it'll break. Because I don't have the same network of global objects that it depends on.
It also makes the code harder to reason about. What value will a function f(x) return?
It obviously depends on what x is. But if I pass the same x twice, will I get the same result? If it uses a lot of globals, then probably not. Then it becomes really difficult to just figure out what it's going to return, and also what else it is going to do. Is it going to set some global variable that's going to affect other, seemingly unrelated, functions?
How can this be achieved, preferably without having to pass it to every constructor to be stored internally until needed
You make it sound like that's a bad thing. If an object needs to know about the screen buffer, then you should give it the screen buffer. Either in the constructor, or in a later call. (And it has a nice bonus: it alerts you if your design is sloppy. If you have 500 classes that need to use the screen buffer, then you have to pass it to 500 constructors. That's painful, and so it's a wake-up call: I am doing something wrong. That many object shouldn't need to know about the screen buffer. How can I fix this?`)
As a more obvious example, say I want to calculate the cosine of 1.42, so I pass 1.42 to the function: cos(1.42)
That's how we usually do it, with no globals. Of course, we could instead say "yeah but everyone needs to be able to set the argument to cos, I'd better make it global". Then it'd look like this:
gVal = 1.42;
cos();
I don't know about you, but I think the first version was more readable.
Like any other design decision, using global variables has a cost. It saves you having to pass variables unnecessarily, and allows you to share state among running functions. But it also has the potential to make your code hard to follow, and to reuse.
Some applications, like embedded systems, use global variables regularly. For them, the added speed of not having to pass the variable or even a pointer into the activation record, and the simplicity, makes it a good decision [arguably]. But their code suffers for it; it is often hard to follow execution and developing systems with increasing complexity becomes more and more difficult.
In a large system, consisting of heterogeneous components, using globals may become a nightmare to maintain. At some point you may need a different screen buffer with different properties, or the screen buffer may not be available until it's initialized meaning you'll have to wrap every call to it with a check if it's null, or you'll need to write multithreaded code, and the global will require a lock.
In short, you are free to use global vars while your application is small enough to manage. When it starts to grow, they will become a liability, and will either require refactoring to remove, or will cripple the programs growth (in terms of capability or stability). The admonition not to use them stems from years of hard-learned lessons, not programmer "hot-headedness".
What if you want to update your engine to support dual screen? Multiple displays are becoming more and more common all the time. Or what if you want to introduce threading? Bang. How about if you want to support more than one rendering subsystem? Whoopsie. I want to pack my code as a library for other people or myself to re-use? Crap.
Another problem is that the order of global init between source files is undefined, making it tricky to maintain more than a couple.
Ultimately, you should have one and only one object that can work with the screen buffer - the rendering object. Thus, the screen buffer pointer should be part of that object.
I agree with you from a fundamental point of view - "never" is inaccurate. Every function call you make is calling a global variable - the address of that function. This is especially true for imported functions like OS functions. There are other things that you simply cannot unglobal, even if you wanted to - like the heap. However, this is most assuredly not the right place to use a global.
The biggest problem with globals is that if you later decide that a global wasn't the right thing to do for any reason (and there are many reasons), then they're absolutely hell to factor out of an existing progam. The simple fact is that using a global is just not thinking. I can't be bothered to design an actual rendering subsystem and object, so I'm just gonna chuck this stuff in a global. It's easy, it's simple, and not doing this was the biggest revolution in software programming, ever, and for good reason.
Make a rendering class. Put the pointer in there. Use a member function. Problem solved.
Edit: I re-read your OP. The problem here is that you've split your responsibilities. Each class (bitmap, text, whatever) should NOT render itself. It should just hold the data that the master rendering object needs to render it. It's a Bitmap's job to represent a bitmap - not to render a bitmap.
Global variables can change in unexpected ways, which is usually not what you want. The state of the application will become complex and unmaintainable. Very easy to make something wrong. Especially if someone else is changing your code;
Singleton might be a better idea. That would at least give you some encapsulation in case you need to make extensions in the future.
One reason not to use global variables is a problem with namespaces (i.e. accidentally using the same name twice);
We do often use global (to namespace) constants at work which is considered normal, as they don't change (in unexpected ways) and it is very convenient to have them available in multiple files.
If the screen buffer is shared between lots of different pieces of code, then you have two options:
1) Pass it around all over the place. This is inconvenient, because every piece of code that uses the screen buffer, even indirectly, needs to be laboriously indicated as such by the fact that this object is passed through the call stack.
2) Use a global. If you do this, then for all you know any function at all in your entire program might use the screen buffer, just by grabbing it from the global[*]. So if you need to reason about the state of the screen buffer, then you need to include the entire program in your reasoning. If only there was some way to indicate which functions modify the screen buffer, and which cannot possibly ever do so. Oh, hang on a second...
This is even aside from the benefits of dependency injection - when testing, and in future iterations of your program, it might be useful for the caller of some function that blits, to be able to say where it should blit to, not necessarily the screen.
The same issues apply as much to singletons as they do to other modifiable globals.
You could perhaps even make a case that it should cost you something to add yet another piece of code that modifies the screen buffer, because you should try to write systems which are loosely coupled, and doing so will naturally result in fairly few pieces of code that need to know anything at all about the screen in order to do their job (even if they know that they're manipulating images, they needn't necessarily care whether those images are in the screen buffer, or some back buffer, or some completely unrelated buffer that's nothing to do with the screen). I'm not actually in favour of making extra work just to punish myself into writing better code, but it's certainly true that globals make it quite easy to add yet another inappropriate wad of coupling to my app.
[*] Well, you may be able to narrow it down on the basis that only TUs that include the relevant header file will have the declaration. There's nothing technically to stop them copy-and-pasting it, but in a code base that's at all well regulated, they won't.
The reason i never do it is because it creates a mess. Imagine setting ALL unique variables to globals, you would have an external list the size of a phonebook.
Another reason could be that you don't know where it is initialized or modified. What if you accidently modify it at place X in file Y? You will never know. What if it isn't initialized yet? You will have to check everytime.
if (global_var = 0) // uh oh :-(
if (object->Instance() = 0) // compile error :-)
This can both be fixed using singletons. You simply cant assign to a function returning you the object's adress.
Besides that: you don't need your screen buffer everywhere in your application, however if you want to: go ahead, it doesn't make the program run less good :-)
And then you still have the namespace problem but that at least gives you compile errors ;-)
"Why not": global variables give you spaghetti information flow.
That's the same as goto gives you spaghetti control flow.
You don't know where anything comes from, or what can be assumed at any point. The INTERCAL solution of introducing a come from statement, while offering some initial hope of finally being sure of where control comes from, turned out to not really solve that problem for goto. Similarly, more modern language features for tracking updates to global variables, like onchangeby, have not turned out to solve that problem for global variables.
Cheers & hth.,
Global variables (and singletons, which are just a wrapper around a global variable) can cause a number of problems, as I discussed in this answer.
For this specific issue -- blittable objects in a game kit -- I'd be apt to suggest a method signature like Sprite::drawOn(Canvas&, const Point&). It shouldn't be excessive overhead to pass a reference to the Canvas around, since it's not likely to be needed except in the paint pathway, and within that pathway you're probably iterating over a collection anyway, so passing it in that loop isn't that hard. By doing this, you're hiding that the main program has only one active screen buffer from the sprite classes, and therefore making it less likely to create a dependency on this fact.
Disclaimer: I haven't used SDL itself before, but I wrote a simple cross-platform C++ game kit back in the late '90s. At the time I was working on my game kit, it was fairly common practice for multi-player X11-based games to run as a single process on one machine that opened a connection to each player's display, which would quite efficiently make a mess of code that assumed the screen buffer was a singleton.
In this case a class that provides member functions for all methods that need access to the screen buffer would be a more OOP friendly approach. Why should everyone and any one have uncontrolled access to it!?
As to whether there are times when a global is better or even necessary, probably not. They are deceptively attractive when you are hacking out some code, because you need jump through no syntactic hoops to access it, but it is generally indicative of poor design, and one that will rapidly atrophy inter maintenance and extension.
Here's a good read on the subject (related to embedded programming, but the points apply to any code, it is just that some embedded programmers thing they have a valid excuse).
I'm also curious to hear the precise explanation, but I can tell you that the Singleton pattern usually works pretty well to fill the role of global variables.

What to do about a 11000 lines C++ source file?

So we have this huge (is 11000 lines huge?) mainmodule.cpp source file in our project and every time I have to touch it I cringe.
As this file is so central and large, it keeps accumulating more and more code and I can't think of a good way to make it actually start to shrink.
The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it. If I were to "simply" split it up, say for a start, into 3 files, then merging back changes from maintenance versions will become a nightmare. And also if you split up a file with such a long and rich history, tracking and checking old changes in the SCC history suddenly becomes a lot harder.
The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows. :-(
What would you do in this situation? Any ideas on how to move new features to a separate source file without messing up the SCC workflow?
(Note on the tools: We use C++ with Visual Studio; We use AccuRev as SCC but I think the type of SCC doesn't really matter here; We use Araxis Merge to do actual comparison and merging of files)
Merging will not be such a big nightmare as it will be when you'll get 30000 LOC file in the future. So:
Stop adding more code to that file.
Split it.
If you can't just stop coding during refactoring process, you could leave this big file as is for a while at least without adding more code to it: since it contains one "main class" you could inherit from it and keep inherited class(es) with overloaded functions in several new small and well designed files.
Find some code in the file which is relatively stable (not changing fast, and doesn't vary much between branches) and could stand as an independent unit. Move this into its own file, and for that matter into its own class, in all branches. Because it's stable, this won't cause (many) "awkward" merges that have to be applied to a different file from the one they were originally made on, when you merge the change from one branch to another. Repeat.
Find some code in the file which basically only applies to a small number of branches, and could stand alone. Doesn't matter whether it's changing fast or not, because of the small number of branches. Move this into its own classes and files. Repeat.
So, we've got rid of the code that's the same everywhere, and the code that's specific to certain branches.
This leaves you with a nucleus of badly-managed code - it's needed everywhere, but it's different in every branch (and/or it changes constantly so that some branches are running behind others), and yet it's in a single file that you're unsuccessfully trying to merge between branches. Stop doing that. Branch the file permanently, perhaps by renaming it in each branch. It's not "main" any more, it's "main for configuration X". OK, so you lose the ability to apply the same change to multiple branches by merging, but this is in any case the core of code where merging doesn't work very well. If you're having to manually manage the merges anyway to deal with conflicts, then it's no loss to manually apply them independently on each branch.
I think you're wrong to say that the kind of SCC doesn't matter, because for example git's merging abilities are probably better than the merge tool you're using. So the core problem, "merging is difficult" occurs at different times for different SCCs. However, you're unlikely to be able to change SCCs, so the issue is probably irrelevant.
It sounds to me like you're facing a number of code smells here. First of all the main class appears to violate the open/closed principle. It also sounds like it is handling too many responsibilities. Due to this I would assume the code to be more brittle than it needs to be.
While I can understand your concerns regarding traceability following a refactoring, I would expect that this class is rather hard to maintain and enhance and that any changes you do make are likely to cause side effects. I would assume that the cost of these outweighs the cost of refactoring the class.
In any case, since the code smells will only get worse with time, at least at some point the cost of these will outweigh the cost of refactoring. From your description I would assume that you're past the tipping point.
Refactoring this should be done in small steps. If possible add automated tests to verify current behavior before refactoring anything. Then pick out small areas of isolated functionality and extract these as types in order to delegate the responsibility.
In any case, it sounds like a major project, so good luck :)
The only solution I have ever imagined to such problems follows. The actual gain by the described method is progressiveness of the evolutions. No revolutions here, otherwise you'll be in trouble very fast.
Insert a new cpp class above the original main class. For now, it would basically redirect all calls to the current main class, but aim at making the API of this new class as clear and succinct as possible.
Once this has been done, you get the possibility to add new functionalities in new classes.
As for existing functionalities, you have to progressively move them in new classes as they become stable enough. You will lose SCC help for this piece of code, but there is not much that can be done about that. Just pick the right timing.
I know this is not perfect, though I hope it can help, and the process must be adapted to your needs!
Additional information
Note that Git is an SCC that can follow pieces of code from one file to another. I have heard good things about it, so it could help while you are progressively moving your work.
Git is constructed around the notion of blobs which, if I understand correctly, represent pieces of code files. Move these pieces around in different files and Git will find them, even if you modify them. Apart from the video from Linus Torvalds mentioned in comments below, I have not been able to find something clear about this.
Confucius say: "first step to getting out of hole is to stop digging hole."
Let me guess: Ten clients with divergent feature sets and a sales manager that promotes "customization"? I've worked on products like that before. We had essentially the same problem.
You recognize that having an enormous file is trouble, but even more trouble is ten versions that you have to keep "current". That's multiple maintenance. SCC can make that easier, but it can't make it right.
Before you try to break the file into parts, you need to bring the ten branches back in sync with each other so that you can see and shape all the code at once. You can do this one branch at a time, testing both branches against the same main code file. To enforce the custom behavior, you can use #ifdef and friends, but it's better as much as possible to use ordinary if/else against defined constants. This way, your compiler will verify all types and most probably eliminate "dead" object code anyway. (You may want to turn off the warning about dead code, though.)
Once there's only one version of that file shared implicitly by all branches, then it's rather easier to begin traditional refactoring methods.
The #ifdefs are primarily better for sections where the affected code only makes sense in the context of other per-branch customizations. One may argue that these also present an opportunity for the same branch-merging scheme, but don't go hog-wild. One colossal project at a time, please.
In the short run, the file will appear to grow. This is OK. What you're doing is bringing things together that need to be together. Afterwards, you'll begin to see areas that are clearly the same regardless of version; these can be left alone or refactored at will. Other areas will clearly differ depending on the version. You have a number of options in this case. One method is to delegate the differences to per-version strategy objects. Another is to derive client versions from a common abstract class. But none of these transformations are possible as long as you have ten "tips" of development in different branches.
I don't know if this solves your problem, but what I guess you want to do is migrate the content of the file to smaller files independent of each other (summed up).
What I also get is that you have about 10 different versions of the software floating around and you need to support them all without messing things up.
First of all there is just no way that this is easy and will solve itself in a few minutes of brainstorming. The functions linked in your file are all vital to your application, and simply cutting them of and migrating them to other files won't save your problem.
I think you only have these options:
Don't migrate and stay with what you have. Possibly quit your job and start working on serious software with good design in addition. Extreme programming is not always the best solution if you are working on a long time project with enough funds to survive a crash or two.
Work out a layout of how you would love your file to look once it's split up. Create the necessary files and integrate them in your application. Rename the functions or overload them to take an additional parameter (maybe just a simple boolean?).
Once you have to work on your code, migrate the functions you need to work on to the new file and map the function calls of the old functions to the new functions.
You should still have your main-file this way, and still be able to see the changes that were made to it, once it comes to a specific function you know exactly when it was outsourced and so on.
Try to convince your co-workers with some good cake that workflow is overrated and that you need to rewrite some parts of the application in order to do serious business.
Exactly this problem is handled in one of the chapters of the book "Working Effectively with Legacy Code" (http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052).
I think you would be best off creating a set of command classes that map to the API points of the mainmodule.cpp.
Once they are in place, you will need to refactor the existing code base to access these API points via the command classes, once that's done, you are free to refactor each command's implementation into a new class structure.
Of course, with a single class of 11 KLOC the code in there is probably highly coupled and brittle, but creating individual command classes will help much more than any other proxy/facade strategy.
I don't envy the task, but as time goes on this problem will only get worse if it's not tackled.
Update
I'd suggest that the Command pattern is preferable to a Facade.
Maintaining/organizing a lot of different Command classes over a (relatively) monolithic Facade is preferable. Mapping a single Facade onto a 11 KLOC file will probably need to be broken up into a few different groups itself.
Why bother trying to figure out these facade groups? With the Command pattern you will be able to group and organise these small classes organically, so you have a lot more flexibility.
Of course, both options are better than the single 11 KLOC and growing, file.
One important advice: Do not mix refactoring and bugfixes. What you want is a Version of your program that is identical to the previous version, except that the source code is differently.
One way could be to start splitting up the least big function/part into it's own file and then either include with a header (thus turning main.cpp into a list of #includes, which sounds a code smell in itself *I'm not a C++ Guru though), but at least it's now split into files).
You could then try to switch all maintenance releases over to the "new" main.cpp or whatever your structure is. Again: No other changes or Bugfixes because tracking those is confusing as hell.
Another thing: As much as you may desire making one big pass at refactoring the whole thing in one go, you might bite off more than you can chew. Maybe just pick one or two "parts", get them into all the releases, then add some more value for your customer (after all, Refactoring does not add direct value so it is a cost that has to be justified) and then pick another one or two parts.
Obviously that requires some discipline in the team to actually use the split files and not just add new stuff to the main.cpp all the time, but again, trying to do one massive refactor may not be the best course of action.
Rofl, this reminds me of my old job. It seems that, before I joined, everything was inside one huge file (also C++). Then they've split it up (at completely random points using includes) into about three (still huge files). The quality of this software was, as you might expect, horrible. The project totaled at about 40k LOC. (containing almost no comments but LOTS of duplicate code)
In the end I did a complete rewrite of the project. I started by redoing the worst part of the project from scratch. Of course I had in mind a possible (small) interface between this new part and the rest. Then I did insert this part into the old project. I didn't refactor the old code to create the interface necessary, but just replaced it. Then I took made small steps from there, rewriting the old code.
I have to say that this took about half a year and there was no development of the old code base beside bugfixes during that time.
edit:
The size stayed at about 40k LOC but the new application contained many more features and presumably less bugs in its initial version than the 8 year old software. One reason of the rewrite was also that we needed the new features and introducing them inside the old code was nearly impossible.
The software was for an embedded system, a label printer.
Another point that I should add is that in theory the project was C++. But it wasn't OO at all, it could have been C. The new version was object oriented.
OK so for the most part rewriting API of production code is a bad idea as a start. Two things need to happen.
One, you need to actually have your team decide to do a code freeze on current production version of this file.
Two, you need to take this production version and create a branch that manages the builds using preprocessing directives to split up the big file. Splitting the compilation using JUST preprocessor directives (#ifdefs, #includes, #endifs) is easier than recoding the API. It's definitely easier for your SLAs and ongoing support.
Here you could simply cut out functions that relate to a particular subsystem within the class and put them in a file say mainloop_foostuff.cpp and include it in mainloop.cpp at the right location.
OR
A more time consuming but robust way would be to devise an internal dependencies structure with double-indirection in how things get included. This will allow you to split things up and still take care of co-dependencies. Note that this approach requires positional coding and therefore should be coupled with appropriate comments.
This approach would include components that get used based on which variant you are compiling.
The basic structure is that your mainclass.cpp will include a new file called MainClassComponents.cpp after a block of statements like the following:
#if VARIANT == 1
# define Uses_Component_1
# define Uses_Component_2
#elif VARIANT == 2
# define Uses_Component_1
# define Uses_Component_3
# define Uses_Component_6
...
#endif
#include "MainClassComponents.cpp"
The primary structure of the MainClassComponents.cpp file would be there to work out dependencies within the sub components like this:
#ifndef _MainClassComponents_cpp
#define _MainClassComponents_cpp
/* dependencies declarations */
#if defined(Activate_Component_1)
#define _REQUIRES_COMPONENT_1
#define _REQUIRES_COMPONENT_3 /* you also need component 3 for component 1 */
#endif
#if defined(Activate_Component_2)
#define _REQUIRES_COMPONENT_2
#define _REQUIRES_COMPONENT_15 /* you also need component 15 for this component */
#endif
/* later on in the header */
#ifdef _REQUIRES_COMPONENT_1
#include "component_1.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_2
#include "component_2.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_3
#include "component_3.cpp"
#endif
#endif /* _MainClassComponents_h */
And now for each component you create a component_xx.cpp file.
Of course i am using numbers but you should use something more logical based on your code.
Using preprocessor allows you to split things up without having to worry about API changes which is a nightmare in production.
Once you have production settled you can then actually work on redesign.
Well I understand your pain :) I've been in a few such projects as well and it's not pretty. There is no easy answer for this.
One approach that may work for you is to start adding safe guards in all functions, that is, checking arguments, pre/post-conditions in methods, then eventually adding unit tests all in order to capture the current functionality of the sources. Once you have this you are better equipped to re-factor the code because you will have asserts and errors popping up alerting you if you have forgotten something.
Sometimes though there are times when refactoring just may bring more pain than benefit. Then it may be better to just leave the original project and in a pseudo maintenance state and start from scratch and then incrementally adding the functionality from the beast.
You should not be concerned with reducing the file-size, but rather with reducing the class-size. It comes down to almost the same, but makes you look at the problem from a different angle (as #Brian Rasmussen suggests, your class seems to have to many responsibilities).
What you have is a classic example a known design antipattern called the blob. Take some time to read the article I point here, and maybe you may find something useful. Besides, if this project is as big as it looks, you should consider some design to prevent growing into code that you can't control.
This isn't an answer to the big problem, but a theoretical solution to a specific piece of it:
Figure out where you want to split the big file into subfiles. Put comments in some special format at each of those points.
Write a fairly trivial script that will break the file apart into subfiles at those points. (Perhaps the special comments have embedded filenames that the script can use as instructions for how to split it.) It should preserve the comments as part of the splitting.
Run the script. Delete the original file.
When you need to merge from a branch, first recreate the big file by concatenating the pieces back together, do the merge, and then re-split it.
Also, if you want to preserve the SCC file history, I expect the best way to do that is to tell your source control system that the individual piece files are copies of the original. Then it will preserve the history of the sections that were kept in that file, although of course it will also record that large parts were "deleted".
One way to split it without too much danger would be to take a historic look at all the line changes. Are there certain functions that are more stable than others? Hot spots of change if you will.
If a line hasn't been changed in a few years you can probably move it to another file without too much worry. I'd take a look at the source annotated with the last revision that touched a given line and see if there are any functions you could pull out.
Wow, sounds great. I think explaining to your boss, that you need a lot of time to refactor the beast is worth a try. If he doesn't agree, quitting is an option.
Anyway, what I suggest is basically throwing out all the implementation and regrouping it into new modules, let's call those "global services". The "main module" would only forward to those services and ANY new code you write will use them instead of the "main module". This should be feasible in a reasonable amount of time (because it's mostly copy and paste), you don't break existing code and you can do it one maintenance version at a time. And if you still have any time left, you can spend it refactoring all old depending modules to also use the global services.
Do not ever touch this file and the code again!
Treat is like something you are stuck with. Start writing adapters for the functionality encoded there.
Write new code in different units and talk only to adapters which encapsulate the functionality of the monster.
... if only one of the above is not possible, quit the job and get you a new one.
My sympathies - in my previous job I encountered a similar situation with a file that was several times larger than the one you have to deal with. Solution was:
Write code to exhaustively test the function in the program in question. Sounds like you won't already have this in hand...
Identify some code that can be abstracted out into a helper/utilities class. Need not be big, just something that is not truly part of your 'main' class.
Refactor the code identified in 2. into a separate class.
Rerun your tests to ensure nothing got broken.
When you have time, goto 2. and repeat as required to make the code manageable.
The classes you build in step 3. iterations will likely grow to absorb more code that is appropriate to their newly-clear function.
I could also add:
0: buy Michael Feathers' book on working with legacy code
Unfortunately this type of work is all too common, but my experience is that there is great value in being able to make working but horrid code incrementally less horrid while keeping it working.
Consider ways to rewrite the entire application in a more sensible way. Maybe rewrite a small section of it as a prototype to see if your idea is feasible.
If you've identified a workable solution, refactor the application accordingly.
If all attempts to produce a more rational architecture fail, then at least you know the solution is probably in redefining the program's functionality.
My 0.05 eurocents:
Re-design the whole mess, split it into subsystems taking into account the technical and business requirements (=many parallel maintenance tracks with potentially different codebase for each, there is obviously a need for high modifiability, etc.).
When splitting into subsystems, analyze the places which have most changed and separate those from the unchanging parts. This should show you the trouble-spots. Separate the most changing parts to their own modules (e.g. dll) in such a way that the module API can be kept intact and you don't need to break BC all the time. This way you can deploy different versions of the module for different maintenance branches, if needed, while having the core unchanged.
The redesign will likely need to be a separate project, trying to do it to a moving target will not work.
As for the source code history, my opinion: forget it for the new code. But keep the history somewhere so you can check it, if needed. I bet you won't need it that much after the beginning.
You most likely need to get management buy-in for this project. You can argue perhaps with faster development time, less bugs, easier maintaining and less overall chaos. Something along the lines of "Proactively enable the future-proofness and maintenance viability of our critical software assets" :)
This is how I'd start to tackle the problem at least.
Start by adding comments to it. With reference to where functions are called and if you can move things around. This can get things moving. You really need to assess how brittle the code base it. Then move common bits of functionality together. Small changes at a time.
Another book you may find interesting/helpful is Refactoring.
Something I find useful to do (and I'm doing it now although not at the scale you face), is to extract methods as classes (method object refactoring). The methods that differ across your different versions will become different classes which can be injected into a common base to provide the different behaviour you need.
I found this sentence to be the most interesting part of your post:
> The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it
First, I would recommend that you use a source control system for developing these 10 + maintenance versions that supports branching.
Second, I would create ten branches (one for each of your maintenance versions).
I can feel you cringing already! But either your source control isn't working for your situation because of a lack of features, or it's not being used correctly.
Now to the branch you work on - refactor it as you see fit, safe in the knowledge that you'll not upset the other nine branches of your product.
I would be a bit concerned that you have so much in your main() function.
In any projects I write, I would use main() only perform initialization of core objects - like a simulation or application object - these classes is where the real work should go on.
I would also initialize an application logging object in main for use globally throughout the program.
Finally, in main I also add leak detection code in preprocessor blocks that ensure it's only enabled in DEBUG builds. This is all I would add to main(). Main() should be short!
You say that
> The file basically contains the "main class" (main internal work dispatching and coordination) of our program
It sounds like these two tasks could be split into two separate objects - a co-ordinator and a work dispatcher.
When you split these up, you may mess up your "SCC workflow", but it sounds like adhering stringently to your SCC workflow is causing software maintenance problems. Ditch it, now and don't look back, because as soon as you fix it, you'll begin to sleep easy.
If you're not able to make the decision, fight tooth and nail with your manager for it - your application needs to be refactored - and badly by the sounds of it! Don't take no for an answer!
As you've described it, the main issue is diffing pre-split vs post-split, merging in bug fixes etc.. Tool around it. It won't take that long to hardcode a script in Perl, Ruby, etc. to rip out most of the noise from diffing pre-split against a concatenation of post-split. Do whatever's easiest in terms of handling noise:
remove certain lines pre/during concatenation (e.g. include guards)
remove other stuff from the diff output if necessary
You could even make it so whenever there's a checkin, the concatenation runs and you've got something prepared to diff against the single-file versions.
"The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows."
If that big SWITCH (which I think there is) becomes the main maintenance problem, you could refactor it to use dictionary and the Command pattern and remove all switch logic from the existing code to the loader, which populates that map, i.e.:
// declaration
std::map<ID, ICommand*> dispatchTable;
...
// populating using some loader
dispatchTable[id] = concreteCommand;
...
// using
dispatchTable[id]->Execute();
I think the easiest way to track the history of source when splitting a file would be something like this:
Make copies of the original source code, using whatever history-preserving copy commands your SCM system provides. You'll probably need to submit at this point, but there's no need yet to tell your build system about the new files, so that should be ok.
Delete code from these copies. That should not break the history for the lines you keep.
I think what I would do in this situation is bit the bullet and:
Figure out how I wanted to split the file up (based on the current development version)
Put an administrative lock on the file ("Nobody touch mainmodule.cpp after 5pm Friday!!!"
Spend your long weekend applying that change to the >10 maintenance versions (from oldest to newest), up to and including the current version.
Delete mainmodule.cpp from all supported versions of the software. It's a new Age - there is no more mainmodule.cpp.
Convince Management that you shouldn't be supporting more than one maintenance version of the software (at least without a big $$$ support contract). If each of your customers have their own unique version.... yeeeeeshhhh. I'd be adding compiler directives rather than trying to maintain 10+ forks.
Tracking old changes to the file is simply solved by your first check-in comment saying something like "split from mainmodule.cpp". If you need to go back to something recent, most people will remember the change, if it's 2 year from now, the comment will tell them where to look. Of course, how valuable will it be to go back more than 2 years to look at who changed the code and why?

Updating a codebase to meet standards

If you've got a codebase which is a bit messy in respect to coding standards - a mix of different conventions from different people - is it reasonable to give one person the task of going through every file and bringing it up to meet standards?
As well as being tremendously dull, you're going to get a mass of changes in SVN (or whatever) which can make comparing versions harder. Is it sensible to set someone on the whole codebase, or is it considered stupid to touch a file only to make it meet standards? Should files be left alone until some 'real' change is needed, and then updated?
Tagged as C++ since I think different languages have different automated tools for this.
Should files be left alone until some 'real' change is needed, and then updated?
This is what I would do.
Even if it's primarily text layout changes, doing it by a manual process on a large scale risks breaking code that was working.
Treat it as a refactor and do it locally whenever code has to be touched for some other reason. Add tests if they're missing to improve your chances of not breaking the code.
If your code is already well covered by tests, you might get away with something global, but I still wouldn't advocate it.
I also think this is pretty much language-agnostic.
It also depends on what kind of changes you are planning to make in order to bring it up to your coding standard. Everyone's definition of coding standard is different.
More specifically:
Can your proposed changes be made to the project with 100% guarantee that the entire project will work identically the same as before? For example, changes that only affect comments, line breaks and whitespaces should be fine.
If you do not have 100% guarantee, then there is a risk that should not be taken unless it can be balanced with a benefit. For example, is there a need to gain a deeper understanding of the current code base in order to continue its development, or fix its bugs? Is the jumble of coding conventions preventing these initiatives? If so, evaluate the costs and benefits and decide whether a makeover is justified.
If you need to understand the current code base, here is a technique: tracing.
Make a copy of the code base. Note that tracing involves adding code, so it should not be performed on the production copy.
In the new copy, insert many fprintf (trace) statements into any functions considered critical. It may be possible to automate this.
Run the project with various inputs and collect those tracing results. This will help everyone understand the current project's design.
Another technique for understanding the current code base is to document the dependencies in the project.
Some kinds of dependencies (interface dependency, C++ include dependency, C++ typedef / identifier dependency) can be extracted by automated tools.
Run-time dependency can only be extracted through tracing, or by profiling tools.
I was thinking it's a task you might give a work-experience kid or put out onto RentaCoder
This depends mainly on the codebase's size.
I've seen three trainees given the task to go through a 2MLoC codebase (several thousand source files) in order to insert one new line into the standard disclaimer at the top of all the source files (with the line's content depending on the file's name and path). It took them several days. One of the three used most of that time to write a script that would do it and later only fixed the files where the script had failed to insert the line correctly, the other two just ploughed through the files. (The one who wrote the script later got a job at that company.)
The job of manually adapting all those files in that codebase to certain coding standards would probably have to be measured in man-years.
OTOH, if it's just a few dozen files, it's certainly doable.
Your codebase is very likely somewhere in between, so your best bet might be to set a "work-experience kid" to find out whether there's a tool that can do this to your satisfaction and, if so, make it work.
Should files be left alone until some 'real' change is needed, and then updated?
I'd strongly advice against this. If you do this, you will have "real" changes intermingled with whatever reformatting took place, making it nigh impossible to see the "real" changes in the diff.
You can address the formatting aspect of coding style fairly easily. There are a number of tools that can auto-format your code. I recommend hooking one of these up to your version control tool's "check in" feature. This way, people can use whatever format they want while editing their code, but when it gets checked in, it's reformatted to the official style.
In general, I think it's best if you can do the big change all at once. In the past, we've done the following:
1. have a time dedicated to the reformatting when most people aren't working (e.g. at night or on the weekend
2. have a person check out as many files as possible at that time, reformat them, and check them in again
With a reformatting-only revision, you don't have to figure out what has changed in addition to the formatting.