Are there any official C++ recommendations that concern with the amount of information that should be disclosed in a method name? I am asking because I can find plenty of references in Internet but none that really explains this.
I'm working on a C++ class with a method called calculateIBANAndBICAndSaveRecordChainIfChanged, which pretty well explains what the method does. A shorter name would be easier to remember and would need no intellisense or copy & paste to type. It would be less descriptive, true, but functionality is supposed to be documented.
calculateIBANAndBICAndSaveRecordChainIfChanged considered to be a bad function name, it breaks the rule of one-function-does-one-thing.
Reduce complexity
The single most important reason to create a routine is to reduce a program's complexity. Create a routine to hide information so that you won't need to think about it. Sure, you'll need to think about it when you write the routine. But after it's written, you should be able to forget the details and use the routine without any knowledge of its internal workings. Other reasons to create routines—minimizing code size, improving maintainability, and improving correctness—are also good reasons, but without the abstractive power of routines, complex programs would be impossible to manage intellectually.
You could simply break this function into below functions:
CalculateIBAN
CalculateBIC
SaveRecordChain
IsRecordChainChanged
To name a procedure, use a strong verb followed by an object
A procedure with functional cohesion usually performs an operation on an object. The name should reflect what the procedure does, and an operation on an object implies a verb-plus-object name. PrintDocument(), CalcMonthlyRevenues(), CheckOrderlnfo(), and RepaginateDocument() are samples of good procedure names.
Describe everything the routine does
In the routine's name, describe all the outputs and side effects. If a routine computes report totals and opens an output file, ComputeReportTotals() is not an adequate name for the routine. ComputeReportTotalsAndOpen-OutputFile() is an adequate name but is too long and silly. If you have routines with side effects, you'll have many long, silly names. The cure is not to use less-descriptive routine names; the cure is to program so that you cause things to happen directly rather than with side effects.
Avoid meaningless, vague, or wishy-washy verbs
Some verbs are elastic, stretched to cover just about any meaning. Routine names like HandleCalculation(), PerformServices(), OutputUser(), ProcessInput(), and DealWithOutput() don't tell you what the routines do. At the most, these names tell you that the routines have something to do with calculations, services, users, input, and output. The exception would be when the verb "handle" was used in the specific technical sense of handling an event.
Most of above points are referred from Code complete II. Other good books are Clean Code, The Clean Coder from Robert C. Martin
To answer the direct question, I don't think function names need to be memorable. It's nice if they are, but like you say this stuff is supposed to be documented. I can look it up.
calculateIBANAndBICAndSaveRecordChainIfChanged is too long for my taste. Aside from the inconvenience of having to c/p or auto-complete to even use them, my fear with long function names is that I don't read them properly either, so names with similar "shapes" start to look confusingly similar to one another.
So I would advise looking for a shorter name. There must be some reason why these operations (calculating two things, and conditionally saving a record chain) have been grouped together. That reason isn't described in the question, it lies somewhere in the specification or the history of your project. You should identify that reason and look to it for a more succinct function name.
When naming a function you can also consider what reasons[*] the function might change in future. Why are there two things (IBANA and BIC) that are calculated at the same time? What is the relationship between them? Can you identify the reason for doing both at once and then saving?
For example: they are the "acronyms" for this object, it's common to want to recalculate the acronyms all at once, and if you recalculate then naturally the changes need saving. Then call the function refreshAcronyms. Maybe there will be a third acronym in future.
For another example: what callers really want is to save the object if changed, and it's an additional chore that to preserve integrity of the stored data, I must always recalculate the IBANA and the BIC before saving. In that case, all the rest is necessary precursors to saving, so I can call the function saveRecordChain. Users of the public interface just need to know that the save function does what needs to be done. There might be a serializeToFile() function in the private interface that saves if changed without doing the extra stuff.
[*] I say "reasons" plural, but Robert C Martin defines the "single responsibility principle" to be that there is only one possible reason to change a well-designed function.
Ideally one method should do only one thing. And your method name should reflect what it does (that one thing), then only your program become readable.
It''s a matter of personal preference although I would think that calculateIBANAndBICAndSaveRecordChainIfChanged is too long and therefore difficult to read and code with (unless you're using a smart editor that can auto-complete)
Two further points:
The function needs to be broken down into smaller parts, as other
posters have suggested.
There's no law against commenting your headers to give a more
detailed description of the function there so you don't have to
build every aspect of its functionality into the name.
You read and write too many methods over the course of your career to remember their names. Most programmers would need to look up a name of a function from their language's standard library, let alone names of functions that their or their team developed! The most memorable function name would be of no use to someone maintaining your code and seeing the call for the first time. Moreover, good chances are that in six months you wouldn't remember it either!
That is why I recommend going for descriptive names first, and not worrying about the ease of memorization: after all, IDEs with intellisense are not going away any time soon (and they were introduced for a good reason - to address our memory limitations).
For personal interaction that would be enough and useful, but any way after completing the app you have to re-factor every function name to exactly what they intend to do. And if working in a group or in company make it sure that function name reflects what its functionality is.
And in your eg function name i may name it like: saveRecordWithRespctToIBANandBIC()
Related
We are working on a medium to large C++ codebase, and are in the process of refactoring it to get it in better shape.
Recently it was suggested that we expand our naming convention for functions that (may) throw exceptions, to make it easier to determine - at a glance - if a function may throw exceptions (directly or indirectly by calling functions that throw).
While I find the ability to more easily get that information would be neat I cannot shake the feeling that this might lead to trouble as there is no tool-assisted way to verify and enforce that convention - thus you cannot really rely on the convention (beyond it giving hints).
As I feel torn and uncertain about this I decided to seek advice from here:
So, is using such naming conventions a good idea/worth the effort and are there established conventions for that?
It really will not determine anything. It would be yet another idiosyncratic convention that's unenforceable at the programming stage: what happens if someone refactors a function and forgets to change its name?
If someone is compelled to change a function name due to changes in the exceptions that can emanate from that function then you could introduce breaking changes into your program due to overload resolution: messing about with function names can have unintended side-effects.
From C++11 you can use the noexcept specifier which might be helpful.
In principle, it is a good idea to have a naming convention, but applying in practice is a real pain. My experience with two huge code base taught that you only start to apply such conventions (if conventions did not exist earlier) only with new designs.
The thumb rule we followed (for all functions, including the ones that do not throw) is to reflect the component name (it part of) and functionality (that it serves) in the method names and then an informative text in the exception message.
The exceptions thrown contains a message that tries to reflect the business/usage context if the software directly interacts with a human end user, or a more technical message if its internal to the logs. There is no golden bullet for naming conventions and is mostly domain driven.
It entirely depends on how mature the software department is in the company and what domain it is.
It can be adopted from C#.
For example for dictionary you have Add and TryAdd.
In my example I am providing a dll and export a function that MUST be named Start.
Also since it is dll export it SHOULD not throw.
But now I want to include this function in tests. Here I actually want it to throw, so I can see stack on test reports.
Now despite all the arguments, the function name shall be informative. Here I use StartUnsafe.
I am thinking about a method to handle the data more efficiently. Let me explain it:
Currently, there is a class, called Rules, it has a lot of member functions, like Rules::isForwardEligible(), Rules::isCurrentNumberEligible()....So these functions are used to check the specific situations (when other process call them), all of them return bool value.
In the body of these functions are ifs which will query the DB to compare data, finally return turn or false.
So the whole thing is like if(Rules::isCurrentNumberEligible())--->Check content in Rules::isCurrentNumberEligible()--->if(xxxx)(xxxx will be another function again, query DB), I think this kind way is not good. I want to improve it.
What I am imagining, is to use less code but query more for the information.
So I can query in the first step if(Rules::isCurrentNumberEligible()), I can set different tables for query, so the things like if(xxx){if(xx){if(xx)....}} will be less. A solutions is to build a class whose role is like a coordinator, ask him each time for different querys. Is it suitable?
I am not sure it is a good way to control this, or may be there are some good solutions aside. Please help me, thanks!
The classical algorithm for rule-based systems is the RETE algorithm. It strives to minimize the number of rules to be evaluated. The trick is that a re-evaluation of a rule does not make sense unless at least one related fact has changed.
In general, those rules should be queried first which promise maximum information gain. This helps to pin-down the respective case in as few questions as possible.
A physician in differential diagnosis would always order his/her questions from general to specific. In information theory this is called the principle of maximum entropy.
I am working on a motion control system, and will have at least 5 motors, each with parameters such as "gearbox ratio", "ticks per rev" "Kp", "Ki", "Kd", etc. that will be referenced upon construction of instances of the motors.
My question to StackOverflow is how should I organize these numbers? I know this is likely a preferential thing, but being new to coding I figure I could get some good opinions from you.
The three approaches I immediately see are as follows:
Write in the call to the constructor, either via variables or numbers-- PROS: limited coding, could be implemented in a way that it's easy to change, but possibly harder than #define's
Use #define's to accomplish similar to above -- PROS: least coding, easy to change (assuming you want to look at the source code)
Load a file (possibly named "motorparameters.txt") and load the parameters into an array and populate from that array. If I really wanted to I could add a GUI approach to changing this file rather than manual. -- PROS:easiest to change without diving into source code.
These parameters could change over time, and while there are other coders at the company, I would like to leave it in a way that's easy to configure. Do any of you see a particular benefit of #define vs. variables? I have a "constants.h" file already that I could easily add the #defines to, or I could add variables near the call to the constructor.
There's a principle know as YAGNI (You Ain't Gonna Need It) which says do the simplest thing first, then change it when (if) your requirements expand.
Sounds to me like the thing to do is:
Write a flexible motor class, that can handle any values (within reason), even though there are only 5 different sets of values you currently care about.
Define a component that returns the "right" values for the 5 motors in your system (or that constructs the 5 motors for your system using the "right values")
Initially implement that component to use some hard-coded values out of a header file
Retain the option to replace that component in future with an implementation of the same API, but that reads values out of a resource file, text file, XML file, GUI interaction with the user, off the internet, by making queries to the hardware to find out what motors it thinks it has, whatever.
I say this on the basis that you minimize expected effort by putting in a point of customizability where you suspect you'll want one (to prevent a lot of work when you change it later), but implement using the simplest thing that satisfies your current certain requirements.
Some people might say that it's not actually worth doing the typing (a) to define the component, better just to construct 5 motors in main() (b) to use constants from a header file, better just to type numeric literals in main(). The (b) people are widely despised as peddlers of "magic constants" (which doesn't mean they're necessarily wrong about relative total programming time by implementer and future maintainers, they just probably are) and the (a) people divide opinion. I tend to figure that defining this kind of thing takes a few minutes, so I don't really care whether it's worth it or not. Loading the values out of a file involves specifying a file format that I might regret as soon as I encounter a real reason to customize, so personally I can't be bothered with that until the requirement arises.
The general idea is to separate the portions of your code that will change from those that won't. The more likely something is to change, the more you need to make it easy to change.
If you're building a commercial app where hundreds or thousands of users will use many different motors, it might make sense to code up a UI and store the data in a config file.
If this is development code and these parameters are unlikely to change, then stuffing them into #defines in your constants.h file is probably the way to go.
Number 3 is a great option if you don't have security or IP concerns. Anytime you or someone else touches your code, you introduce the possibility of regressions. By keeping your parameters in a text file, not only are you making life easier on yourself, you're also reducing the scope of possible errors down the road.
I face a situation where we have many very long methods, 1000 lines or more.
To give you some more detail, we have a list of incoming high level commands, and each generates results in a longer (sometime huge) list of lower level commands. There's a factory creating an instance of a class for each incoming command. Each class has a process method, where all the lower level commands are generated added in sequence. As I said, these sequences of commands and their parameters cause quite often the process methods to reach thousands of lines.
There are a lot of repetitions. Many command patterns are shared between different commands, but the code is repeated over and over. That leads me to think refactoring would be a very good idea.
On the contrary, the specs we have come exactly in the same form as the current code. Very long list of commands for each incoming one. When I've tried some refactoring, I've started to feel uncomfortable with the specs. I miss the obvious analogy between the specs and code, and lose time digging into newly created common classes.
Then here the question: in general, do you think such very long methods would always need refactoring, or in a similar case it would be acceptable?
(unfortunately refactoring the specs is not an option)
edit:
I have removed every reference to "generate" cause it was actually confusing. It's not auto generated code.
class InCmd001 {
OutMsg process ( InMsg& inMsg ) {
OutMsg outMsg = OutMsg::Create();
OutCmd001 outCmd001 = OutCmd001::Create();
outCmd001.SetA( param.getA() );
outCmd001.SetB( inMsg.getB() );
outMsg.addCmd( outCmd001 );
OutCmd016 outCmd016 = OutCmd016::Create();
outCmd016.SetF( param.getF() );
outMsg.addCmd( outCmd016 );
OutCmd007 outCmd007 = OutCmd007::Create();
outCmd007.SetR( inMsg.getR() );
outMsg.addCmd( outCmd007 );
// ......
return outMsg;
}
}
here the example of one incoming command class (manually written in pseudo c++)
Code never needs refactoring. The code either works, or it doesn't. And if it works, the code doesn't need anything.
The need for refactoring comes from you, the programmer. The person reading, writing, maintaining and extending the code.
If you have trouble understanding the code, it needs to be refactored. If you would be more productive by cleaning up and refactoring the code, it needs to be refactored.
In general, I'd say it's a good idea for your own sake to refactor 1000+ line functions. But you're not doing it because the code needs it. You're doing it because that makes it easier for you to understand the code, test its correctness, and add new functionality.
On the other hand, if the code is automatically generated by another tool, you'll never need to read it or edit it. So what'd be the point in refactoring it?
I understand exactly where you're coming from, and can see exactly why you've structured your code the way it is, but it needs to change.
The uncertainty you feel when you attempt to refactor can be ameliorated by writing unit tests. If you've tests specific to each spec, then the code for each spec can be refactored until you're blue in the face, and you can have confidence in it.
A second option, is it possible to automatically generate your code from a data structure?
If you've a core suite of classes that do the donkey work and edge cases, you can auto-generate the repetitive 1000 line methods as often as you wish.
However, there are exceptions to every rule.
If the methods are a literal interpretation of the spec (very little additional logic), and the specs change infrequently, and the "common" portions (i.e. bits that happen to be the same right now) of the specs change at different times, and you're not going to be asked to get a 10x performance gain out of the code anytime soon, then (and only then) . . . you may be better off with what you have.
. . . but on the whole, refactor.
Yes, always. 1000 lines is at least 10x longer than any function should ever be, and I'm tempted to say 100x, except that when dealing with input parsing and validation it can become natural to write functions with 20 or so lines.
Edit: Just re-read your question and I'm not clear on one point - are you talking about machine generated code that no-one has to touch? In which case I would leave things as they are.
Refectoring is not the same as writing from scratch. While you should never write code like this, before you refactor it, you need to consider the costs of refactoring in terms of time spent, the associated risks in terms of breaking code that already works, and the net benefits in terms of future time saved. Refactor only if the net benefits outweigh the associated costs and risks.
Sometimes wrapping and rewriting can be a safer and more cost effective solution, even if it appears expensive at first glance.
Long methods need refactoring if they are maintained (and thus need to be understood) by humans.
As a rule of thumb, code for humans first. I don't agree with the common idea that functions need to be short. I think what you need to aim at is when a human reads your code they grok it quickly.
To this effect it's a good idea to simplify things as much as possible--but not more than that. It's a good idea to delegate roughly one task for each function. There is no rule as for what "roughly one task" means: you'll have to use your own judgement for that. But do recognize that a function split into too many other functions itself reduces readability. Think about the human being who reads your function for the first time: they would have to follow one function call after another, constantly context-switching and maintaining a stack in their mind. This is a task for machines, not for humans.
Find the balance.
Here, you see how important naming things is. You will see it is not that easy to choose names for variables and functions, it takes time, but on the other hand it can save a lot of confusion on the human reader's side. Again, find the balance between saving your time and the time of the friendly humans who will follow you.
As for repetition, it's a bad idea. It's something that needs to be fixed, just like a memory leak. It's a ticking bomb.
As others have said before me, changing code can be expensive. You need to do the thinking as for whether it will pay off to spend all this time and effort, facing the risks of change, for a better code. You will possibly lose lots of time and make yourself one headache after another now, in order to possibly save lots of time and headache later.
Take a look at the related question How many lines of code is too many?. There are quite a few tidbits of wisdom throughout the answers there.
To repost a quote (although I'll attempt to comment on it a little more here)... A while back, I read this passage from Ovid's journal:
I recently wrote some code for
Class::Sniff which would detect "long
methods" and report them as a code
smell. I even wrote a blog post about
how I did this (quelle surprise, eh?).
That's when Ben Tilly asked an
embarrassingly obvious question: how
do I know that long methods are a code
smell?
I threw out the usual justifications,
but he wouldn't let up. He wanted
information and he cited the excellent
book Code Complete as a
counter-argument. I got down my copy
of this book and started reading "How
Long Should A Routine Be" (page 175,
second edition). The author, Steve
McConnell, argues that routines should
not be longer than 200 lines. Holy
crud! That's waaaaaay to long. If a
routine is longer than about 20 or 30
lines, I reckon it's time to break it
up.
Regrettably, McConnell has the cheek
to cite six separate studies, all of
which found that longer routines were
not only not correlated with a greater
defect rate, but were also often
cheaper to develop and easier to
comprehend. As a result, the latest
version of Class::Sniff on github now
documents that longer routines may not
be a code smell after all. Ben was
right. I was wrong.
(The rest of the post, on TDD, is worth reading as well.)
Coming from the "shorter methods are better" camp, this gave me a lot to think about.
Previously my large methods were generally limited to "I need inlining here, and the compiler is being uncooperative", or "for one reason or another the giant switch block really does run faster than the dispatch table", or "this stuff is only called exactly in sequence and I really really don't want function call overhead here". All relatively rare cases.
In your situation, though, I'd have a large bias toward not touching things: refactoring carries some inherent risk, and it may currently outweigh the reward. (Disclaimer: I'm slightly paranoid; I'm usually the guy who ends up fixing the crashes.)
Consider spending your efforts on tests, asserts, or documentation that can strengthen the existing code and tilt the risk/reward scale before any attempt to refactor: invariant checks, bound function analysis, and pre/postcondition tests; any other useful concepts from DBC; maybe even a parallel implementation in another language (maybe something message oriented like Erlang would give you a better perspective, given your code sample) or even some sort of formal logical representation of the spec you're trying to follow if you have some time to burn.
Any of these kinds of efforts generally have a few results, even if you don't get to refactor the code: you learn something, you increase your (and your organization's) understanding of and ability to use the code and specifications, you might find a few holes that really do need to be filled now, and you become more confident in your ability to make a change with less chance of disastrous consequences.
As you gain a better understanding of the problem domain, you may find that there are different ways to refactor you hadn't thought of previously.
This isn't to say "thou shalt have a full-coverage test suite, and DBC asserts, and a formal logical spec". It's just that you are in a typically imperfect situation, and diversifying a bit -- looking for novel ways to approach the problems you find (maintainability? fuzzy spec? ease of learning the system?) -- may give you a small bit of forward progress and some increased confidence, after which you can take larger steps.
So think less from the "too many lines is a problem" perspective and more from the "this might be a code smell, what problems is it going to cause for us, and is there anything easy and/or rewarding we can do about it?"
Leaving it cooking on the backburner for a bit -- coming back and revisiting it as time and coincidence allows (e.g. "I'm working near the code today, maybe I'll wander over and see if I can't document the assumptions a bit better...") may produce good results. Then again, getting royally ticked off and deciding something must be done about the situation is also effective.
Have I managed to be wishy-washy enough here? My point, I think, is that the code smells, the patterns/antipatterns, the best practices, etc -- they're there to serve you. Experiment to get used to them, and then take what makes sense for your current situation, and leave the rest.
I think you first need to "refactor" the specs. If there are repetitions in the spec it also will become easier to read, if it makes use of some "basic building blocks".
Edit: As long as you cannot refactor the specs, I wouldn't change the code.
Coding style guides are all made for easier code maintenance, but in your special case the ease of maintenance is achieved by following the spec.
Some people here asked if the code is generated. In my opinion it does not matter: If the code follows the spec "line by line" it makes no difference if the code is generated or hand-written.
1000 thousand lines of code is nothing. We have functions that are 6 to 12 thousand lines long. Of course those functions are so big, that literally things get lost in there, and no tool can help us even look at high level abstractions of them. the code is now unfortunately incomprehensible.
My opinion of functions that are that big, is that they were not written by brilliant programmers but by incompetent hacks who shouldn't be left anywhere near a computer - but should be fired and left flipping burgers at McDonald's. Such code wreaks havok by leaving behind features that cannot be added to or improved upon. (too bad for the customer). The code is so brittle that it cannot be modified by anyone - even the original authors.
And yes, those methods should be refactored, or thrown away.
Do you ever have to read or maintain the generated code?
If yes, then I'd think some refactoring might be in order.
If no, then the higher-level language is really the language you're working with -- the C++ is just an intermediate representation on the way to the compiler -- and refactoring might not be necessary.
Looks to me that you've implemented a separate language within your application - have you considered going that way?
It has been my understanding that it's recommended that any method over 100 lines of code be refactored.
I think some rules may be a little different in his era when code is most commonly viewed in an IDE. If the code does not contain exploitable repetition, such that there are 1,000 lines which are going to be referenced once each, and which share a significant number of variables in a clear fashion, dividing the code into 100-line routines each of which is called once may not be that much of an improvement over having a well-formatted 1,000-line module which includes #region tags or the equivalent to allow outline-style viewing.
My philosophy is that certain layouts of code generally imply certain things. To my mind, when a piece of code is placed into its own routine, that suggests that the code will be usable in more than one context (exception: callback handlers and the like in languages which don't support anonymous methods). If code segment #1 leaves an object in an obscure state which is only usable by code segment #2, and code segment #2 is only usable on a data object which is left in the state created by #1, then absent some compelling reason to put the segments in different routines, they should appear in the same routine. If a program puts objects through a chain of obscure states extending for many hundreds of lines of code, it might be good to rework the design of the code to subdivide the operation into smaller pieces which have more "natural" pre- and post- conditions, but absent some compelling reason to do so, I would not favor splitting up the code without changing the design.
For further reading, I highly recommend the long, insightful, entertaining, and sometimes bitter discussion of this topic over on the Portland Pattern Repository.
I've seen cases where it is not the case (for example, creating an Excel spreadsheet in .Net often requires a lot of line of code for the formating of the sheet), but most of the time, the best thing would be to indeed refactor it.
I personally try to make a function small enough so it all appears on my screen (without affecting the readability of course).
1000 lines? Definitely they need to be refactored. Also not that, for example, default maximum number of executable statements is 30 in Checkstyle, well-known coding standard checker.
If you refactor, when you refactor, add some comments to explain what the heck it's doing.
If it had comments, it would be much less likely a candidate for refactoring, because it would already be easier to read and follow for someone starting from scratch.
Then here the question: in general, do
you think such very long methods would
always need refactoring,
if you ask in general, we will say Yes .
or in a
similar case it would be acceptable?
(unfortunately refactoring the specs
is not an option)
Sometimes are acceptable, but is very unusual, I will give you a pair of examples:
There are some 8 bit microcontrollers called Microchip PIC, that have only a fixed 8 level stack, so you can't nest more than 8 calls, then care must be taken to avoid "stack overflow", so in this special case having many small function (nested) is not the best way to go.
Other example is when doing optimization of code (at very low level) so you have to take account the jump and context saving cost. Use it with care.
EDIT:
Even in generated code, you could need to refactorize the way its generated, for example for memory saving, energy saving, generate human readable, beauty, who knows, etc..
There has been very good general advise, here a practical recommendation for your sample:
common patterns can be isolated in plain feeder methods:
void AddSimpleTransform(OutMsg & msg, InMsg const & inMsg,
int rotateBy, int foldBy, int gonkBy = 0)
{
// create & add up to three messages
}
You might even improve that by making this a member of OutMsg, and using a fluent interface, such that you can write
OutMsg msg;
msg.AddSimpleTransform(inMsg, 12, 17)
.Staple("print")
.AddArtificialRust(0.02);
which can be an additional improvement under circumstances.
Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have been working on some 10 year old C code at my job this week, and after implementing a few changes, I went to the boss and asked if he needed anything else done. That's when he dropped the bomb. My next task was to go through the 7000 or so lines and understand more of the code, and to modularize the code somewhat. I asked him how he would like the source code modularized, and he said to start putting the old C code into C++ classes.
Being a good worker, I nodded my head yes, and went back to my desk, where I sit now, wondering how in the world to take this code, and "modularize" it. It's already in 20 source files, each with its own purpose and function. In addition, there are three "main" structs. each of these structures has 30 plus fields, many of them being other, smaller structs. It's a complete mess to try to understand, but almost every single function in the program is passed a pointer to one of the structs and uses the struct heavily.
Is there any clean way for me to shoehorn this into classes? I am resolved to do it if it can be done, I just have no idea how to begin.
First, you are fortunate to have a boss who recognizes that code refactoring can be a long-term cost-saving strategy.
I've done this many times, that is, converting old C code to C++. The benefits may surprise you. The final code may be half the original size when you're done, and much simpler to read. Plus, you will likely uncover tricky C bugs along the way. Here are the steps I would take in your case. Small steps are important because you can't jump from A to Z when refactoring a large body of code. You have to go through small, intermediate steps which may never be deployed, but which can be validated and tagged in whatever RCS you are using.
Create a regression/test suite. You will run the test suite each time you complete a batch of changes to the code. You should have this already, and it will be useful for more than just this refactoring task. Take the time to make it comprehensive. The exercise of creating the test suite will get you familiar with the code.
Branch the project in your revision control system of choice. Armed with a test suite and playground branch, you will be empowered to make large modifications to the code. You won't be afraid to break some eggs.
Make those struct fields private. This step requires very few code changes, but can have a big payoff. Proceed one field at a time. Try to make each field private (yes, or protected), then isolate the code which access that field. The simplest, most non-intrusive conversion would be to make that code a friend function. Consider also making that code a method. Converting the code to be a method is simple, but you will have to convert all of the call sites as well. One is not necessarily better than the other.
Narrow the parameters to each function. It's unlikely that any function requires access to all 30 fields of the struct passed as its argument. Instead of passing the entire struct, pass only the components needed. If a function does in fact seem to require access to many different fields of the struct, then this may be a good candidate to be converted to an instance method.
Const-ify as many variables, parameters, and methods as possible. A lot of old C code fails to use const liberally. Sweeping through from the bottom up (bottom of the call graph, that is), you will add stronger guarantees to the code, and you will be able to identify the mutators from the non-mutators.
Replace pointers with references where sensible. The purpose of this step has nothing to do with being more C++-like just for the sake of being more C++-like. The purpose is to identify parameters that are never NULL and which can never be re-assigned. Think of a reference as a compile-time assertion which says, this is an alias to a valid object and represents the same object throughout the current scope.
Replace char* with std::string. This step should be obvious. You might dramatically reduce the lines of code. Plus, it's fun to replace 10 lines of code with a single line. Sometimes you can eliminate entire functions whose purpose was to perform C string operations that are standard in C++.
Convert C arrays to std::vector or std::array. Again, this step should be obvious. This conversion is much simpler than the conversion from char to std::string because the interfaces of std::vector and std::array are designed to match the C array syntax. One of the benefits is that you can eliminate that extra length variable passed to every function alongside the array.
Convert malloc/free to new/delete. The main purpose of this step is to prepare for future refactoring. Merely changing C code from malloc to new doesn't directly gain you much. This conversion allows you to add constructors and destructors to those structs, and to use built-in C++ automatic memory tools.
Replace localize new/delete operations with the std::auto_ptr family. The purpose of this step is to make your code exception-safe.
Throw exceptions wherever return codes are handled by bubbling them up. If the C code handles errors by checking for special error codes then returning the error code to its caller, and so on, bubbling the error code up the call chain, then that C code is probably a candidate for using exceptions instead. This conversion is actually trivial. Simply throw the return code (C++ allows you to throw any type you want) at the lowest level. Insert a try{} catch(){} statement at the place in the code which handles the error. If no suitable place exists to handle the error, consider wrapping the body of main() in a try{} catch(){} statement and logging it.
Now step back and look how much you've improved the code, without converting anything to classes. (Yes, yes, technically, your structs are classes already.) But you haven't scratched the surface of OO, yet managed to greatly simplify and solidify the original C code.
Should you convert the code to use classes, with polymorphism and an inheritence graph? I say no. The C code probably does not have an overall design which lends itself to an OO model. Notice that the goal of each step above has nothing to do with injecting OO principles into your C code. The goal was to improve the existing code by enforcing as many compile-time constraints as possible, and by eliminating or simplifying the code.
One final step.
Consider adding benchmarks so you can show them to your boss when you're done. Not just performance benchmarks. Compare lines of code, memory usage, number of functions, etc.
Really, 7000 lines of code is not very much. For such a small amount of code a complete rewrite may be in order. But how is this code going to be called? Presumably the callers expect a C API? Or is this not a library?
Anyway, rewrite or not, before you start, make sure you have a suite of tests which you can run easily, with no human intervention, on the existing code. Then with every change you make, run the tests on the new code.
This shoehorning into C++ seems to be arbitrary, ask your boss why he needs that done, figure out if you can meet the same goal less painfully, see if you can prototype a subset in the new less painful way, then go and demo to your boss and recommend that you follow the less painful way.
First, tell your boss you're not continuing until you have:
http://www.amazon.com/Refactoring-Improving-Design-Existing-Code/dp/0201485672
and to a lesser extent:
http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052
Secondly, there is no way of modularising code by shoe-horning it into C++ class. This is a huge task and you need to communicate the complexity of refactoring highly proceedural code to your boss.
It boils down to making a small change (extract method, move method to class etc...) and then testing - there is no short cuts with this.
I do feel your pain though...
I guess that the thinking here is that increasing modularity will isolate pieces of code, such that future changes are facilitated. We have confidence in changing one piece because we know it cannot affect other pieces.
I see two nightmare scenarios:
You have nicely structured C code, it will easily transform to C++ classes. In which case it probably already is pretty darn modular, and you've probably done nothing useful.
It's a rats-nest of interconnected stuff. In which case it's going to be really tough to disentangle it. Increasing modularity would be good, but it's going to be a long hard slog.
However, maybe there's a happy medium. Could there be pieces of logic that important and conceptually isolated but which are currently brittle because of a lack of data-hiding etc. (Yes good C doesn't suffer from this, but we don't have that, otherwise we would leave well alone).
Pulling out a class to own that logic and its data, encpaulating that piece could be useful. Whether it's better to do it wih C or C++ is open to question. (The cynic in me says "I'm a C programmer, great C++ a chance to learn something new!")
So: I'd treat this as an elephant to be eaten. First decide if it should be eaten at all, bad elephent is just no fun, well structured C should be left alone. Second find a suitable first bite. And I'd echo Neil's comments: if you don't have a good automated test suite, you are doomed.
I think a better approach could be totally rewrite the code, but you should ask your boss for what purpose he wants you "to start putting the old C code into c++ classes".
You should ask for more details
Surely it can be done - the question is at what cost? It is a huge task, even for 7K LOC. Your boss must understand that it's gonna take a lot of time, while you can't work on shiny new features etc. If he doesn't fully understand this, and/or is not willing to support you, there is no point starting.
As #David already suggested, the Refactoring book is a must.
From your description it sounds like a large part of the code is already "class methods", where the function gets a pointer to a struct instance and works on that instance. So it could be fairly easily converted into C++ code. Granted, this won't make the code much easier to understand or better modularized, but if this is your boss' prime desire, it can be done.
Note also, that this part of the refactoring is a fairly simple, mechanical process, so it could be done fairly safely without unit tests (with hyperaware editing of course). But for anything more you need unit tests to make sure your changes don't break anything.
It's very unlikely that anything will be gained by this exercise. Good C code is already more modular than C++ typically can be - the use of pointers to structs allows compilation units to be independent in the same was as pImpl does in C++ - in C you don't have to expose the data inside a struct to expose its interface. So if you turn each C function
// Foo.h
typedef struct Foo_s Foo;
int foo_wizz (const Foo* foo, ... );
into a C++ class with
// Foo.hxx
class Foo {
// struct Foo members copied from Foo.c
int wizz (... ) const;
};
you will have reduced the modularity of the system compared with the C code - every client of Foo now needs rebuilding if any private implementation functions or member variables are added to the Foo type.
There are many things classes in C++ do give you, but modularity is not one of them.
Ask your boss what the business goals are being achieved by this exercise.
Note on terminology:
A module in a system is a component with a well defined interface which can be replaced with another module with the same interface without effecting the rest of the system. A system composed of such modules is modular.
For both languages, the interface to a module is by convention a header file. Consider string.h and string as defining the interfaces to simple string processing modules in C and C++. If there is a bug in the implementation of string.h, a new libc.so is installed. This new module has the same interface, and anything dynamically linked to it immediately gets the benefit of the new implementation. Conversely, if there is a bug in string handling in std::string, then every project which uses it needs to be rebuilt. C++ introduces a very large amount of coupling into systems, which the language does nothing to mitigate - in fact, the better uses of C++ which fully exploit its features are often a lot more tightly coupled than the equivalent C code.
If you try and make C++ modular, you typically end up with something like COM, where every object has to have both an interface (a pure virtual base class) and an implementation, and you substitute an indirection for efficient template generated code.
If you don't care about whether your system is composed of replaceable modules, then you don't need to perform actions to to make it modular, and can use some of the features of C++ such as classes and templates which, suitable applied, can improve cohesion within a module. If your project is to produce a single, statically linked application then you don't have a modular system, and you can afford to not care at all about modularity. If you want to create something like anti-grain geometry which is beautiful example of using templates to couple together different algorithms and data structures, then you need to do that in C++ - pretty well nothing else widespread is as powerful.
So be very careful what your manager means by 'modularise'.
If every file already has "its own purpose and function" and "every single function in the program is passed a pointer to one of the structs" then the only difference made in changing it into classes would be to replace the pointer to the struct with the implicit this pointer. That would have no effect on how modularised the system is, in fact (if the struct is only defined in the C file rather than in the header) it will reduce modularity.
With “just” 7000 lines of C code, it will probably be easier to rewrite the code from scratch, without even trying to understand the current code.
And there is no automated way to do or even assist the modularization and refactoring that you envisage.
7000 LOC may sound like much but a lot of this will be boilerplate.
Try and see if you can simplify the code before changing it to c++. Basically though I think he just wants you to convert functions into class methods and convert structs into class data members (if they don't contain function pointers, if they do then convert these to actual methods). Can you get in touch with the original coder(s) of this program? They could help you get some understanding done but mainly I would be searching for that piece of code that is the "engine" of the whole thing and base the new software from there. Also, my boss told me that sometimes it is better to simply rewrite the whole thing, but the existing program is a very good reference to mimic the run time behavior of. Of course specialized algorithms are hard to recode. One thing I can assure you of is that if this code is not the best it could be then you are going to have alot of problems later on. I would go up to your boss and promote the fact that you need to redo from scratch parts of the program. I have just been there and I am really happy my supervisor gave me the ability to rewrite. Now the 2.0 version is light years ahead of the original version.
I read this article which is titled "Make bad code good" from http://www.javaworld.com/javaworld/jw-03-2001/jw-0323-badcode.html?page=7 . Its directed at Java users, but all of its ideas our pretty applicable to your case I think. Though the title makes it sound likes it is only for bad code, I think the article is for maintenance engineers in general.
To summarize Dr. Farrell's ideas, he says:
Start with the easy things.
Fix the comments
Fix the formatting
Follow project conventions
Write automated tests
Break up big files/functions
Rewrite code you don't understand
I think after following everyone else's advice this might be a good article to read when you have some free time.
Good luck!