C++: Using '.' operator on expressions and function calls

C++: Using '.' operator on expressions and function calls - c++

I was wondering if it is good practice to use the member operator . like this:
someVector = (segment.getFirst() - segment.getSecond()).normalize().normalCCW();
Just made that to show the two different things I was wondering, namely if using (expressions).member/function() and foo.getBar().getmoreBar() were in keeping with the spirit of readability and maintainability. In all the c++ code and books I learned from, I've never seen it used in this way, yet its intoxicatingly easy to use it as such. Don't want to develop any bad habits though.
Probably more (or less) important than that, I was also wondering if there would be any performance gains/losses by using it in this fashion, or unforeseen pitfalls that would introduce bugs into the program.
Thank you in advance!

or unforeseen pitfalls that would introduce bugs into the program
Well, the possible pitfalls would be
Harder to debug. You won't be able to view the results of each function call, so if one of them is returning something unexpected you will need to break it up into smaller segments to see what is going on. Also, any call in the chain may fail completely, so again, you may have to break it up to find out which call is failing.
Harder to read (sometimes). Chaining function calls can make the code harder to read. It depends on the situation, there's no hard and fast rule here. If the expression is even somewhat complex it can make things hard to follow. I don't have any problem reading your specific example.
It ultimately comes down to personal preference. I don't strive to fit as much as possible on a single line, and I have been bitten enough times by chaining where I shouldn't that I tend to break things up a bit. However, for simple expressions which are not likely to fail, chaining is fine.

Yes, this is perfectly acceptable and in fact would be completely unreadable in a lot of contexts if you were to NOT do this.
It's called method chaining.
There MIGHT be some performance gain in that you're not creating temporary variables. But any competent compiler will optimise it anyway.

it is perfectly valid to use it the way you showed. It is used in the named parameter idiom described in C++ faq lite for example.
One reason it is not always used is when you have to store intermediate result for performance reasons (if normalize is costly and you have to use it more than one time, it is better to store the result in a variable) or readability.
my2c

Using a variable to hold intermediate results can sometimes enhance readability, especially if you use good variable names. Excessive chaining can make it hard to understand what is happening. You have to use your judgement to decide if it's worthwhile to break down chains using variables. The example you present above is not excessive to me. Performance shouldn't differ much one way or the other if you enable optimization.

someVector = (segment.getFirst() - segment.getSecond()).normalize().normalCCW();
Not an answer to your question, but I should tell you that
the behavior of the expression (segment.getFirst() - segment.getSecond()) is not well-defined as per the C++ Standard. The order in which each operand is evaluated is unspecified by the Standard!
Also, see this related topic : Is this code well-defined?

I suppose what you are doing is less readable, however on the other hand, too many temporary variables can also become unreadable.
As far performance I'm sure there is a little overhead when making temporary variables but the compiler could optimize that out.

There's no big problem with using it in this way- some APIs benefit greatly from method chaining. Plus, it's misleading to create a variable, and then only use it once. When someone reads the next line, they don't have to think about all those variables that you now didn't keep.

It depends of what you're doing.
For readability you should try to use intermediate variables.
Assign calculation results to pointers, and then use them.

Related

Is it really better to have an unnecessary function call instead of using else?

So I had a discussion with a colleague today. He strongly suggested me to change a code from
if(condition){
function->setValue(true)
}
else{
function->setValue(false)
}
to
function->setValue(false)
if(condition){
function->setValue(true)
}
in order to avoid the 'else'. I disagreed, because - while it might improve readability to some degree - in the case of the if-condition being true, we have 1 absolutely unnecessary function call.
What do you guys think?

Meh.
To do this to just to avoid the else is silly (at least there should be a deeper rationale). There's no extra branching cost to it typically, especially after the optimizer goes through it.
Code compactness can sometimes be a desirable aesthetic, especially if a lot of time is spent skimming and searching through code than reading it line-by-line. There can be legit reasons to favor terser code sometimes, but it's always cons and pros. But even code compactness should not be about cramming logic into fewer lines of code so much as just straightforward logic.
Correctness here might be easier to achieve with one or the other. The point was made in a comment that you might not know the side effects associated with calling setValue(false), though I would suggest that's kind of moot. Functions should have minimal side effects, they should all be documented at the interface/usage level if they aren't totally obvious, and if we don't know exactly what they are, we should be spending more time looking up their documentation prior to calling them (and their side effects should not be changing once firm dependencies are established to them).
Given that, it may sometimes be easier to achieve correctness and maintain it with a solution that starts out initializing states to some default value, and using a form of code that opts in to overwrite it in specific branches of code. From that standpoint, what your colleague suggested may be valid as a way to avoid tripping over that code in the future. Then again, for a simple if/else pair of branches, it's hardly a big deal.
Don't worry about the cost of the extra most-likely-constant-time function call either way in this kind of knee-deep micro-level implementation case, especially with no super tight performance-critical loop around this code (and even then, still prefer to worry about that at least a little bit in hindsight after profiling).
I think there are far better things to think about than this kind of coding style, like testing procedure. Reliable code tends to need less revisiting, and has the freedom to be written in a wider variety of ways without causing disputes. Testing is what establishes reliability. The biggest disputes about coding style tend to follow teams where there's more toe-stepping and more debugging of the same bodies of code over and over and over from disparate people due to lack of reliability, modularity, excessive coupling, etc. It's a symptom of a problem but not necessarily the root cause.

Does C++ code run faster if there is no structure in program

I know it helps a lot if we structure our programs using classes, structs etc. but does it help in terms of running speed that we avoid these structures and write code plain in terms of basic C++ syntax?
For example, I am trying to write a program that works on vectors. Now it sounds tempting to write a class vector and define its methods like set_at_index(int i) that sets the value of specific row i of this vector. Furthermore I can check whether i<=N where N is the length of the vector in question.
My confusion is that with these routine every set_at_index method that is used a lot will require one 'if' statement. So if I want my code to run faster should I avoid it and go with declaring an array and manually take care that there is no memory leak?
Is there any way I can check for the memory leaks without putting burden on the code speed?

Yes, bounds checking will take slightly more time. But it will take so little extra time that it will only matter if the code is being run 28894389375 times and then it might add up to a millisecond. Note that std::vector only performs bounds checking if you use the at member function, not if you use operator[]. Also, if you are doing anything like writing to a file or printing text to the console, doing that one time will likely take more time than ten million bounds-checked array accesses, because I/O is relatively very very slow.
Typically, without bounds checking code using classes will run at the same speed as code using plain arrays. The problem with manually managing memory like you suggest is that it's easy to forget to clean it up, or to clean it up only through one path of execution through the program, or to fail to clean it up in the event of an exception. It's really hardly ever worth it. Also, it'll be just as fast to use a vector class without bounds checking as it will be to use a dynamic array without bounds checking. You pay for it either way.
I also suggest using std::vector instead of writing your own vector class since they do pretty much every optimisation you could do yourself, and they usually have the advantage of being able to write the code for their specific compiler and perhaps be able to take advantage of things that only that compiler does because they know more of its implementation. The STL classes are also rigorously tested and written by experts (usually).
You should write your code first, then measure with a profiler to see the bottlenecks in your code if it is not fast enough already, then optimise the bottlenecks. I will bet that bounds checking on arrays is probably not going to be one of those bottlenecks.
Checking for memory leaks can be done with a tool like valgrind. You don't do it in the code itself.

Don't try to over optimize before you even start writing. Go ahead and write code that is easily maintainable, readable, and as bug free as possible. Once you have things working, you can start profiling to see the real bottlenecks.

"Premature optimization is root of all evils" - Donald Knuth. (this is true 97% of the time).
Unless you profile your application and see that your class encapsulation is a bottleneck that does slow your application in a significant amount, don't hesitate to have high level structures. It will brings you plenty of benefits like readability, maintainance, and understanding what you do. That's what brings OOP: Big scale programs.

Some good answers have already been posted, and premature optimization is indeed inadvisable as others have said. However, let me put your question in slightly another light.
I know it helps a lot if we structure our programs using classes,
structs etc. but does it help in terms of running speed that we avoid
these structures and write code plain in terms of basic C++ syntax?
Theoretically, most properly written C++ code should run just as fast with fully developed classes as without, but
there are exceptions to the rule;
the effort required to write the C++ code theoretically properly may be too great; and
the same features of the C++ compiler that make it hard to write incorrect code can make it all too easy to write grossly inefficient code.
Point-by-point remarks follow.
Consider a complex three-dimensional vector type, of which each instance consists of six doubles (three real parts and three imaginary parts). If there were not so many doubles, your compiler might load them directly into your microprocessor's registers, but with six they are likely to remain on the stack when the complex three-dimensional vector is loaded. Some operations however on a complex three-dimensional vector do not require all six doubles, but only one, two or three of them. If so, then it might be preferable to store the six floating-point components separately. Thus, rather than an array of 1000 vectors, you'd keep six arrays of 1000 doubles each. Of course, one can (and probably should) bind the arrays together in a class of some kind, but -- for efficiency reasons only -- a good design might never explicitly associate individual elements from one array to another.
Sometimes, you know where your data is and what you want to do with it, and C++'s elaborate organizational and access-control facilities only get in your way. In this case, you might skip the high-level C++ and just do what you want in primitive, hackworthy, brutish, machete-wielding C-style code. Indeed, C++ explicitly supports this style of coding by making it possible -- nay, easy -- to wrap the primitive C code safely within a module and thus to hide its horror from the rest of your beautiful C++ program. Of course, if you hand your code a machete, so to speak, then you take a risk, don't you, because your code may hack up data you never wanted it to, and your compiler will stand aside and let it do it; but sometimes the risk is worth the gain, and sometimes the risk is even fun (and character-building) for a programmer's change of pace.
This point is the most subtle of the three. Where a user-defined type consists partly of other user-defined types, multiple layers of constructors will be called and implicitly invoked. This is great, and usually it is what you want, especially if you have a good unit-testing regime at each layer. The rose however has a thorn, as it were. A properly written constructor is careful never to lose anything it needs. So, unless the programmer is most careful, a constructor may quietly make a lot of strictly unnecessary copies of very large objects. Sometimes, the programmer will mentally lose track of all the levels of implicit invocation, which he never would have done if he had had to handle each invocation explicitly. Also, your data in an object of one type may lack access to a member function to which it can easily gain access, so long is it temporarily copies itself to an object of another type (you can avoid the copy with the use of handle types, reference counting and so forth, but this is not free: it's quite a bit of work). Even if the programmer is conscious of the implicit copy, the implicit copy is so much easier to code in the moment that the temptation to do so is sometimes too great -- especially when a deadline looms! Several hidden inefficiencies can arise in these ways. One can, and should, work around such inefficiencies, of course, but it can take a lot of coding effort to do so and, even then, your compiler is so busy helping you to avoid logical errors that it tends to cause you to create inadvertent inefficiencies that you would never purposely have created. The unnecessary, hidden copying of data is a much bigger problem in C++ than it ever was in C.
All in all, I would say that the C++ trade-off is worth it 80 percent of the time. C++'s organizational and access-control facilities merit the effort it takes to apply them properly. If your question regards the 20 percent, well, there is more than one valid approach to programming, in my view. Sometimes it really does help "that we avoid these structures and write code plain in terms of basic C++ syntax," as you have said.
Usually, no. Sometimes, yes. I think that the earlier answers are right, though, that the particular example you have posed is probably better treated in boring, neat, orderly C++, without tricks.

Two things:
DO NOT do any kind of premature optimization, CPUs are fast nowadays, compilers are smart and able to figure out optimizations that you wouldn't think of in months of looking at your code.
you can easily check things like memory leaks by profiling your code and/or using conditional compilation. Leaks shouldn't occur on release versions so you should just skip that checks.

C++ style vs. performance?

C++ style vs. performance - is using C-style things, that are faster the some C++ equivalents, that bad practice ? For example:
Don't use atoi(), itoa(), atol(), etc. ! Use std::stringstream <- probably sometimes it's better, but always? What's so bad using the C functions? Yep, C-style, not C++, but whatever? This is C++, we're looking for performance all the time..
Never use raw pointers, use smart pointers instead - OK, they're really useful, everyone knows that, I know that, I use the all the time and I know how much better they're that raw pointers, but sometimes it's completely safe to use raw pointers.. Why not? "Not C++ style? <- is this enough?
Don't use bitwise operations - too C-style? WTH? Why not, when you're sure what you're doing? For example - don't do bitwise exchange of variables ( a ^= b; b ^= a; a ^= b; ) - use standard 3-step exchange. Don't use left-shift for multiplying by two. Etc, etc.. (OK, that's not C++ style vs. C-style, but still "not good practice" )
And finally, the most expensive - "Don't use enum-s to return codes, it's too C-style, use exceptions for different errors" ? Why? OK, when we're talking about error handling on deep levels - OK, but why always? What's so wrong with this, for example - when we're talking about a function, that returns different error codes and when the error handling will be implemented only in the function, that calls the first one? I mean - no need to pass the error codes on a upper level. Exceptions are rather slow and they're exceptions for exceptional situations, not for .. beauty.
etc., etc., etc.
Okay, I know that good coding style is very, very important <- the code should be easy to read and understand. I know that there's no need from micro optimizations, as the modern compilers are very smart and Compiler optimizations are very powerful. But I also know how expensive is the exceptions handling, how (some) smart_pointers are implemented, and that there's no need from smart_ptr all the time.. I know that, for example, atoi is not that "safe" as std::stringstream is, but still.. What about performance?
EDIT: I'm not talking about some really hard things, that are only C-style specific. I mean - don't wonder to use function pointers or virtual methods and these kind of stuff, that a C++ programmer may not know, if never used such things (while C programmers do this all the time). I'm talking about some more common and easy things, such as in the examples.

In general, the thing you're missing is that the C way often isn't faster. It just looks more like a hack, and people often think hacks are faster.
Never use raw pointers, use smart pointers instead - OK, they're really useful, everyone knows that, I know that, I use the all the time and I know how much better they're that raw pointers, but sometimes it's completely safe to use raw pointers.. Why not?
Let's turn the question on its head. Sometimes it's safe to use raw pointers. Is that alone a reason to use them? Is there anything about raw pointers that is actually superior to smart pointers? It depends. Some smart pointer types are slower than raw pointers. Others aren't. What is the performance rationale for using a raw pointer over a std::unique_ptr or a boost::scoped_ptr? Neither of them have any overhead, they just provide safer semantics.
This isn't to say that you should never use raw pointers. Just that you shouldn't do it just because you think you need performance, or just because "it seems safe". Do it when you need to represent something that smart pointers can't. As a rule of thumb, use pointers to point to things, and smart pointers to take ownership of things. But it's a rule of thumb, not a universal rule. Use whichever fits the task at hand. But don't blindly assume that raw pointers will be faster. And when you use smart pointers, be sure you are familiar with them all. Too many people just use shared_ptr for everything, and that is just awful, both in terms of performance and the very vague shared ownership semantics you end up applying to everything.
Don't use bitwise operations - too C-style? WTH? Why not, when you're sure what you're doing? For example - don't do bitwise exchange of variables ( a ^= b; b ^= a; a ^= b; ) - use standard 3-step exchange. Don't use left-shift for multiplying by two. Etc, etc.. (OK, that's not C++ style vs. C-style, but still "not good practice" )
That one is correct. And the reason is "it's faster". Bitwise exchange is problematic in many ways:
it is slower on a modern CPU
it is more subtle and easier to get wrong
it works with a very limited set of types
And when multiplying by two, multiply by two. The compiler knows about this trick, and will apply it if it is faster. And once again, shifting has many of the same problems. It may, in this case, be faster (which is why the compiler will do it for you), but it is still easier to get wrong, and it works with a limited set of types. In paticular, it might compile fine with types that you think it is safe to do this trick with... And then blow up in practice. In particular, bit shifting on negative values is a minefield. Let the compiler navigate it for you.
Incidentally, this has nothing to do with "C style". The exact same advice applies in C. In C, a regular swap is still faster than the bitwise hack, and bitshifting instead of a multiply will still be done by the compiler if it is valid and if it is faster.
But as a programmer, you should use bitwise operations for one thing only: to do bitwise manipulation of integers. You've already got a multiplication operator, so use that when you want to multiply. And you've also got a std::swap function. Use that if you want to swap two values. One of the most important tricks when optimizing is, perhaps surprisingly, to write readable, meaningful code. That allows your compiler to understand the code and optimize it. std::swap can be specialized to do the most efficient exchange for the particular type it's used on. And the compiler knows several ways to implement multiplication, and will pick the fastest one depending on circumstance... If you tell it to. If you tell it to bit shift instead, you're just misleading it. Tell it to multiply, and it will give you the fastest multiply it has.
And finally, the most expensive - "Don't use enum-s to return codes, it's too C-style, use exceptions for different errors" ?
Depends on who you ask. Most C++ programmers I know of find room for both. But keep in mind that one unfortunate thing about return codes is that they're easily ignored. If that is unacceptable, then perhaps you should prefer an exception in this case. Another point is that RAII works better together with exceptions, and a C++ programmer should definitely use RAII wherever possible. Unfortunately, because constructors can't return error codes, exceptions are often the only way to indicate errors.
but still.. What about performance?
What about it? Any decent C programmer would be happy to tell you not to optimize prematurely.
Your CPU can execute perhaps 8 billion instructions per second. If you make two calls to a std::stringstream in that second, is that going to make a measurable dent in the budget?
You can't predict performance. You can't make up a coding guideline that will result in fast code. Even if you never throw a single exception, and never ever use stringstream, your code still won't automatically be fast. If you try to optimize while you write the code, then you're going to spend 90% of the effort optimizing the 90% of the code that is hardly ever executed. In order to get a measurable improvement, you need to focus on the 10% of the code that make up 95% of the execution time. Trying to make everything fast just results in a lot of wasted time with little to show for it, and a much uglier code base.

I'd advise against atoi, and atol as a rule, but not just on style grounds. They make it essentially impossible to detect input errors. While a stringstream can do the same job, strtol (for one example) is what I'd usually advise as the direct replacement.
I'm not sure who's giving that advice. Use smart pointers when they're helpful, but when they're not, there's nothing wrong with using a raw pointer.
I really have no idea who thinks it's "not good practice" to use bitwise operators in C++. Unless there were some specific conditions attached to that advice, I'd say it was just plain wrong.
This one depends heavily on where you draw the line between an exceptional input, and (for example) an input that's expected, but not usable. Generally speaking, if you're accepting input direct from the user, you can't (and shouldn't) classify anything as truly exceptional. The main good point of exceptions (even in a situation like this) is ensuring that errors aren't just ignored. OTOH, I don't think that's always the sole criterion, so you can't say it's the right way to handle every situation.
All in all, it sounds to me like you've gotten some advice that's dogmatic to the point of ignoring reality. It's probably best ignored or at least viewed as one rather extreme position about how C++ could be written, not necessarily how it always (or ever, necessarily) should be written.

Adding to #Jerry Coffin's answer, which I think is extremely useful, I would like to present some subjective observations.
The thing is that programmers tend to get fancy. That is, most of us really like writing fancy code just for the sake of it. This is perfectly fine as long as you are doing the project on your own. Remember a good software is the one whose binary code works as expected and not the one whose source code is clean. However when it comes to larger projects which are developed and maintained by lots of people, it is economically better to write simpler code so that no one from the team loses time to understand what you meant. Even at the cost of runtime(naturally minor cost). That's why many people, including myself, would discourage using the xor trick instead of assignment(you may be surprised but there are extremely many programmers out there that haven't heard of the xor trick). The xor trick works only for integers anyway, and the traditional way of swapping integers is very fast anyway, so using the xor trick is just being fancy.
using itoa, atoi etc instead of streams is faster. Yes, it is. But how much faster? Not much. Unless most of your program does only conversions from text to string and vice versa you won't notice the difference. Why do people use itoa, atoi etc? Well, some of them do, because they are unaware of the c++ alternative. Another group does because it's just one LOC. For the former group - shame on you, for the latter - why not boost::lexical_cast?
exceptions... ah ... yeah, they can be slower than return codes but in most cases not really. Return codes can contain information, which is not an error. Exceptions should be used to report severe errors, ones which cannot be ignored. Some people forget about this and use exceptions for simulating some weird signal/slot mechanisms (believe me, I have seen it, yuck). My personal opinion is that there is nothing wrong using return codes, but severe errors should be reported with exceptions, unless the profiler has shown that refraining from them would considerably boost the performance
raw pointers - My own opinion is this: never use smart pointers when it's not about ownership. Always use smart pointers when it's about ownership. Naturally with some exceptions.
bit-shifting instead of multiplication by powers of two. This, I believe, is a classic example of premature optimization. x << 3; I bet at least 25% of your co-workers will need some time before they will understand/realize this means x * 8; Obfuscated (at least for 25%) code for which exact reasons? Again, if the profiler tells you this is the bottleneck (which I doubt will be the case for extremely rare cases), then green light, go ahead and do it (leaving a comment that in fact this means x * 8)
To sum it up. A good professional acknowledges the "good styles", understands why and when they are good, and rightfully makes exceptions because he knows what he's doing. Average/bad professionals are categorized into 2 types: first type doesn't acknowledge good style, doesn't even understand what and why it is. fire them. The other type treats the style as a dogma, which is not always good.

What's a best practice ? Wikipedia's words are better than mine would be :
A best practice is a technique,
method, process, activity, incentive,
or reward which conventional wisdom
regards as more effective at
delivering a particular outcome than
any other technique, method, process,
etc. when applied to a particular
condition or circumstance.
[...]
A given best practice is only
applicable to particular condition or
circumstance and may have to be
modified or adapted for similar
circumstances. In addition, a "best"
practice can evolve to become better
as improvements are discovered.
I believe there is no such thing as universal truth in programming : if you think that something is a better fit in your situation than a so called "best practice", then do what you believe is right, but know perfectly why you do (ie: prove it with numbers).

Functions with mutable char* arguments are bad in C++ because it's too difficult to manually handle their memory, since we have an alternatives. They aren't generic, we can't easily switch from char to wchar_t as basic_string allows. Also lexical_cast is more straight replacement for atoi, itoa.
If you don't really need smartness of a smart pointer - don't use it.
To swap use swap. Use bitwise operations only for bitwise operations - checking/setting/inverting flags, etc.
Exceptions are fast. They allow removing error checking condition branches, so if they really "never happen" - they increase performance.

Multiplication by bitshifting doesn't improve performance in C, the compiler will do that for you. Just be sure to multiply or divide by 2^n values for performance.
Bitfield swapping is also something that'll probably just confuse your compiler.
I'm not very experienced with string handling in C++, but from from what I know, it's hard to believe it's more flexible than scanf and printf.
Also, these "you should never" statements, I generally regard them as recommendations.

All of your questions are a-priori. What I mean is you are asking them in the abstract, not in the context of any specific program whose performance is your concern.
That's like trying to swim without being in water.
If you do tuning on a specific concrete program, you will find performance problems, and chances are they will have almost nothing whatever to do with these abstract questions. They will most likely all be things you could not have thought of a-priori.
For a specific example of this, look here.
If I could generalize from experience, a major source of performance problems is galloping generality.
That is, while data structure abstraction is generally considered a good thing, any good thing can be massively over-used, and then it becomes a crippling bad thing. This is not rare. In my experience it is typical.

I think you're answering big parts of your question on your own. I personally prefer easy-to-read code (even if you understand C style, maybe the next to read your code has more trouble with it) and safe code (which suggests stringstream, exceptions, smart pointers...)
If you really have something where it makes sense to consider bitwise operations - ok. But often I see C programmers use a char instead of a couple of bools. I do NOT like this.
Speed is important, but most of the time is usually required at a few hotspots in a program. So unless you measure that some technique is a problem (or you know pretty sure that it will become one) I would rather use what you call C++ style.

Why the expensiveness of exceptions is an argument? Exceptions are exceptions because they are rare. Their performance doesn't influence the overall performance. The steps you have to take to make your code exception-safe do not influence the performance either. But on the other hand exceptions are convenient and flexible.

This is not really an "answer", but if you work in a project where performance is important (e.g. embedded/games), people usually do the faster C way instead of the slower C++ way in the ways you described.
The exception may be bitwise operations, where not as much is gained as you might think. For example, "Don't use left-shift for multiplying by two." A half-way decent compiler will generate the same code for << 2 and * 2.

Do very long methods always need refactoring?

I face a situation where we have many very long methods, 1000 lines or more.
To give you some more detail, we have a list of incoming high level commands, and each generates results in a longer (sometime huge) list of lower level commands. There's a factory creating an instance of a class for each incoming command. Each class has a process method, where all the lower level commands are generated added in sequence. As I said, these sequences of commands and their parameters cause quite often the process methods to reach thousands of lines.
There are a lot of repetitions. Many command patterns are shared between different commands, but the code is repeated over and over. That leads me to think refactoring would be a very good idea.
On the contrary, the specs we have come exactly in the same form as the current code. Very long list of commands for each incoming one. When I've tried some refactoring, I've started to feel uncomfortable with the specs. I miss the obvious analogy between the specs and code, and lose time digging into newly created common classes.
Then here the question: in general, do you think such very long methods would always need refactoring, or in a similar case it would be acceptable?
(unfortunately refactoring the specs is not an option)
edit:
I have removed every reference to "generate" cause it was actually confusing. It's not auto generated code.
class InCmd001 {
OutMsg process ( InMsg& inMsg ) {
OutMsg outMsg = OutMsg::Create();
OutCmd001 outCmd001 = OutCmd001::Create();
outCmd001.SetA( param.getA() );
outCmd001.SetB( inMsg.getB() );
outMsg.addCmd( outCmd001 );
OutCmd016 outCmd016 = OutCmd016::Create();
outCmd016.SetF( param.getF() );
outMsg.addCmd( outCmd016 );
OutCmd007 outCmd007 = OutCmd007::Create();
outCmd007.SetR( inMsg.getR() );
outMsg.addCmd( outCmd007 );
// ......
return outMsg;
}
}
here the example of one incoming command class (manually written in pseudo c++)

Code never needs refactoring. The code either works, or it doesn't. And if it works, the code doesn't need anything.
The need for refactoring comes from you, the programmer. The person reading, writing, maintaining and extending the code.
If you have trouble understanding the code, it needs to be refactored. If you would be more productive by cleaning up and refactoring the code, it needs to be refactored.
In general, I'd say it's a good idea for your own sake to refactor 1000+ line functions. But you're not doing it because the code needs it. You're doing it because that makes it easier for you to understand the code, test its correctness, and add new functionality.
On the other hand, if the code is automatically generated by another tool, you'll never need to read it or edit it. So what'd be the point in refactoring it?

I understand exactly where you're coming from, and can see exactly why you've structured your code the way it is, but it needs to change.
The uncertainty you feel when you attempt to refactor can be ameliorated by writing unit tests. If you've tests specific to each spec, then the code for each spec can be refactored until you're blue in the face, and you can have confidence in it.
A second option, is it possible to automatically generate your code from a data structure?
If you've a core suite of classes that do the donkey work and edge cases, you can auto-generate the repetitive 1000 line methods as often as you wish.
However, there are exceptions to every rule.
If the methods are a literal interpretation of the spec (very little additional logic), and the specs change infrequently, and the "common" portions (i.e. bits that happen to be the same right now) of the specs change at different times, and you're not going to be asked to get a 10x performance gain out of the code anytime soon, then (and only then) . . . you may be better off with what you have.
. . . but on the whole, refactor.

Yes, always. 1000 lines is at least 10x longer than any function should ever be, and I'm tempted to say 100x, except that when dealing with input parsing and validation it can become natural to write functions with 20 or so lines.
Edit: Just re-read your question and I'm not clear on one point - are you talking about machine generated code that no-one has to touch? In which case I would leave things as they are.

Refectoring is not the same as writing from scratch. While you should never write code like this, before you refactor it, you need to consider the costs of refactoring in terms of time spent, the associated risks in terms of breaking code that already works, and the net benefits in terms of future time saved. Refactor only if the net benefits outweigh the associated costs and risks.
Sometimes wrapping and rewriting can be a safer and more cost effective solution, even if it appears expensive at first glance.

Long methods need refactoring if they are maintained (and thus need to be understood) by humans.

As a rule of thumb, code for humans first. I don't agree with the common idea that functions need to be short. I think what you need to aim at is when a human reads your code they grok it quickly.
To this effect it's a good idea to simplify things as much as possible--but not more than that. It's a good idea to delegate roughly one task for each function. There is no rule as for what "roughly one task" means: you'll have to use your own judgement for that. But do recognize that a function split into too many other functions itself reduces readability. Think about the human being who reads your function for the first time: they would have to follow one function call after another, constantly context-switching and maintaining a stack in their mind. This is a task for machines, not for humans.
Find the balance.
Here, you see how important naming things is. You will see it is not that easy to choose names for variables and functions, it takes time, but on the other hand it can save a lot of confusion on the human reader's side. Again, find the balance between saving your time and the time of the friendly humans who will follow you.
As for repetition, it's a bad idea. It's something that needs to be fixed, just like a memory leak. It's a ticking bomb.
As others have said before me, changing code can be expensive. You need to do the thinking as for whether it will pay off to spend all this time and effort, facing the risks of change, for a better code. You will possibly lose lots of time and make yourself one headache after another now, in order to possibly save lots of time and headache later.

Take a look at the related question How many lines of code is too many?. There are quite a few tidbits of wisdom throughout the answers there.
To repost a quote (although I'll attempt to comment on it a little more here)... A while back, I read this passage from Ovid's journal:
I recently wrote some code for
Class::Sniff which would detect "long
methods" and report them as a code
smell. I even wrote a blog post about
how I did this (quelle surprise, eh?).
That's when Ben Tilly asked an
embarrassingly obvious question: how
do I know that long methods are a code
smell?
I threw out the usual justifications,
but he wouldn't let up. He wanted
information and he cited the excellent
book Code Complete as a
counter-argument. I got down my copy
of this book and started reading "How
Long Should A Routine Be" (page 175,
second edition). The author, Steve
McConnell, argues that routines should
not be longer than 200 lines. Holy
crud! That's waaaaaay to long. If a
routine is longer than about 20 or 30
lines, I reckon it's time to break it
up.
Regrettably, McConnell has the cheek
to cite six separate studies, all of
which found that longer routines were
not only not correlated with a greater
defect rate, but were also often
cheaper to develop and easier to
comprehend. As a result, the latest
version of Class::Sniff on github now
documents that longer routines may not
be a code smell after all. Ben was
right. I was wrong.
(The rest of the post, on TDD, is worth reading as well.)
Coming from the "shorter methods are better" camp, this gave me a lot to think about.
Previously my large methods were generally limited to "I need inlining here, and the compiler is being uncooperative", or "for one reason or another the giant switch block really does run faster than the dispatch table", or "this stuff is only called exactly in sequence and I really really don't want function call overhead here". All relatively rare cases.
In your situation, though, I'd have a large bias toward not touching things: refactoring carries some inherent risk, and it may currently outweigh the reward. (Disclaimer: I'm slightly paranoid; I'm usually the guy who ends up fixing the crashes.)
Consider spending your efforts on tests, asserts, or documentation that can strengthen the existing code and tilt the risk/reward scale before any attempt to refactor: invariant checks, bound function analysis, and pre/postcondition tests; any other useful concepts from DBC; maybe even a parallel implementation in another language (maybe something message oriented like Erlang would give you a better perspective, given your code sample) or even some sort of formal logical representation of the spec you're trying to follow if you have some time to burn.
Any of these kinds of efforts generally have a few results, even if you don't get to refactor the code: you learn something, you increase your (and your organization's) understanding of and ability to use the code and specifications, you might find a few holes that really do need to be filled now, and you become more confident in your ability to make a change with less chance of disastrous consequences.
As you gain a better understanding of the problem domain, you may find that there are different ways to refactor you hadn't thought of previously.
This isn't to say "thou shalt have a full-coverage test suite, and DBC asserts, and a formal logical spec". It's just that you are in a typically imperfect situation, and diversifying a bit -- looking for novel ways to approach the problems you find (maintainability? fuzzy spec? ease of learning the system?) -- may give you a small bit of forward progress and some increased confidence, after which you can take larger steps.
So think less from the "too many lines is a problem" perspective and more from the "this might be a code smell, what problems is it going to cause for us, and is there anything easy and/or rewarding we can do about it?"
Leaving it cooking on the backburner for a bit -- coming back and revisiting it as time and coincidence allows (e.g. "I'm working near the code today, maybe I'll wander over and see if I can't document the assumptions a bit better...") may produce good results. Then again, getting royally ticked off and deciding something must be done about the situation is also effective.
Have I managed to be wishy-washy enough here? My point, I think, is that the code smells, the patterns/antipatterns, the best practices, etc -- they're there to serve you. Experiment to get used to them, and then take what makes sense for your current situation, and leave the rest.

I think you first need to "refactor" the specs. If there are repetitions in the spec it also will become easier to read, if it makes use of some "basic building blocks".
Edit: As long as you cannot refactor the specs, I wouldn't change the code.
Coding style guides are all made for easier code maintenance, but in your special case the ease of maintenance is achieved by following the spec.
Some people here asked if the code is generated. In my opinion it does not matter: If the code follows the spec "line by line" it makes no difference if the code is generated or hand-written.

1000 thousand lines of code is nothing. We have functions that are 6 to 12 thousand lines long. Of course those functions are so big, that literally things get lost in there, and no tool can help us even look at high level abstractions of them. the code is now unfortunately incomprehensible.
My opinion of functions that are that big, is that they were not written by brilliant programmers but by incompetent hacks who shouldn't be left anywhere near a computer - but should be fired and left flipping burgers at McDonald's. Such code wreaks havok by leaving behind features that cannot be added to or improved upon. (too bad for the customer). The code is so brittle that it cannot be modified by anyone - even the original authors.
And yes, those methods should be refactored, or thrown away.

Do you ever have to read or maintain the generated code?
If yes, then I'd think some refactoring might be in order.
If no, then the higher-level language is really the language you're working with -- the C++ is just an intermediate representation on the way to the compiler -- and refactoring might not be necessary.

Looks to me that you've implemented a separate language within your application - have you considered going that way?

It has been my understanding that it's recommended that any method over 100 lines of code be refactored.

I think some rules may be a little different in his era when code is most commonly viewed in an IDE. If the code does not contain exploitable repetition, such that there are 1,000 lines which are going to be referenced once each, and which share a significant number of variables in a clear fashion, dividing the code into 100-line routines each of which is called once may not be that much of an improvement over having a well-formatted 1,000-line module which includes #region tags or the equivalent to allow outline-style viewing.
My philosophy is that certain layouts of code generally imply certain things. To my mind, when a piece of code is placed into its own routine, that suggests that the code will be usable in more than one context (exception: callback handlers and the like in languages which don't support anonymous methods). If code segment #1 leaves an object in an obscure state which is only usable by code segment #2, and code segment #2 is only usable on a data object which is left in the state created by #1, then absent some compelling reason to put the segments in different routines, they should appear in the same routine. If a program puts objects through a chain of obscure states extending for many hundreds of lines of code, it might be good to rework the design of the code to subdivide the operation into smaller pieces which have more "natural" pre- and post- conditions, but absent some compelling reason to do so, I would not favor splitting up the code without changing the design.

For further reading, I highly recommend the long, insightful, entertaining, and sometimes bitter discussion of this topic over on the Portland Pattern Repository.

I've seen cases where it is not the case (for example, creating an Excel spreadsheet in .Net often requires a lot of line of code for the formating of the sheet), but most of the time, the best thing would be to indeed refactor it.
I personally try to make a function small enough so it all appears on my screen (without affecting the readability of course).

1000 lines? Definitely they need to be refactored. Also not that, for example, default maximum number of executable statements is 30 in Checkstyle, well-known coding standard checker.

If you refactor, when you refactor, add some comments to explain what the heck it's doing.
If it had comments, it would be much less likely a candidate for refactoring, because it would already be easier to read and follow for someone starting from scratch.

Then here the question: in general, do
you think such very long methods would
always need refactoring,
if you ask in general, we will say Yes .
or in a
similar case it would be acceptable?
(unfortunately refactoring the specs
is not an option)
Sometimes are acceptable, but is very unusual, I will give you a pair of examples:
There are some 8 bit microcontrollers called Microchip PIC, that have only a fixed 8 level stack, so you can't nest more than 8 calls, then care must be taken to avoid "stack overflow", so in this special case having many small function (nested) is not the best way to go.
Other example is when doing optimization of code (at very low level) so you have to take account the jump and context saving cost. Use it with care.
EDIT:
Even in generated code, you could need to refactorize the way its generated, for example for memory saving, energy saving, generate human readable, beauty, who knows, etc..

There has been very good general advise, here a practical recommendation for your sample:
common patterns can be isolated in plain feeder methods:
void AddSimpleTransform(OutMsg & msg, InMsg const & inMsg,
int rotateBy, int foldBy, int gonkBy = 0)
{
// create & add up to three messages
}
You might even improve that by making this a member of OutMsg, and using a fluent interface, such that you can write
OutMsg msg;
msg.AddSimpleTransform(inMsg, 12, 17)
.Staple("print")
.AddArtificialRust(0.02);
which can be an additional improvement under circumstances.

Is it a good idea to apply some basic macros to simplify code in a large project?

I've been working on a foundational c++ library for some time now, and there are a variety of ideas I've had that could really simplify the code writing and managing process. One of these is the concept of introducing some macros to help simplify statements that appear very often, but are a bit more complicated than should be necessary.
For example, I've come up with this basic macro to simplify the most common type of for loop:
#define loop(v,n) for(unsigned long v=0; v<n; ++v)
This would enable you to replace those clunky for loops you see so much of:
for (int i = 0; i < max_things; i++)
With something much easier to write, and even slightly more efficient:
loop (i, max_things)
Is it a good idea to use conventions like this? Are there any problems you might run into with different types of compilers? Would it just be too confusing for someone unfamiliar with the macro(s)?

IMHO this is generally a bad idea. You are essentially changing well known and understood syntax to something of your own invention. Before long you may find that you have re-invented the language. :)

No, not a good idea.
int max = 23;
loop(i, ++max)...
It is, however, a good idea to refactor commonly used code into reusable components and then reuse instead of copy. You should do this through writing functions similar to the standard algorithms like std::find(). For instance:
template < typename Function >
void loop(size_t count, Function f)
{
for (size_t i = 0; i < count, ++i) f();
}
This is a much safer approach:
int max = 23;
loop(++max, boost::bind(....));

I think you've provided one strong argument against this macro with your example usage. You changed the loop iterator type from int to unsigned long. That has nothing to do with how much typing you want to do, so why change it?
That cumbersome for loop specifies the start value, end value, type and name of the iterator. Even if we assume the final part will always be ++name, and we're happy to stick to that, you have two choices - remove some of the flexibility or type it all out every time. You've opted to remove flexibility, but you also seem to be using that flexibility in your code base.

I would say it depends upon whether you expect anyone else to ever have to make sense of your code. If it's only ever going to be you in there, then I don't see a problem with the macros.
If anyone else is ever going to have to look at this code, then the macros are going to cause problems. The other person won't know what they are or what they do (no matter how readable and obvious they seem to you) and will have to go hunting for them when they first run across them. The result will be to make your code unreadable to anyone but yourself - anyone using it will essentially have to learn a new language and program at the same time.
And since the chances of it just being you dealing with the code are pretty much nil if you hope the code to be a library that will be for more than just your personal use - then I'd go with don't.

In Unix, I find that by the time I want to create an alias for a command I use all the time, the command is on my fingers, and I'd have a harder time remembering the syntax of my alias than the original command.
The same applies here -- by the time you use an idiom so much that you want to create a macro for it, the idiom will be on you fingers and cause you more pain than just typing out the code.

Getting rid of the for loops is generally a good idea -- but replacing them with macros is not. I'd take a long, hard look at the standard library algorithms instead.

Apart from the maintenance/comprehension problems mentionned by others, you'll also have a hard time breaking and single-stepping through macro code.
One area where I think macros might be acceptable would be for populating large data structures with constants/litterals (when it can save an excessive amount of typing). You normally would not single-step through such code.

Steve Jessop makes a good point. Macros have their uses. If I may expound upon his statements, I would go so far as to say that the argument for or against macros comes down to "It depends". If you make your macros without careful thought, you risk making future maintaners' lives harder. On the other hand, using the wxWidgets library requires using library provided macros to connect your code with the gui library. In this case, the macros lower the barrier of entry for using the library, as magic whose innards are irrelevant to understanding how to work with the library are hidden away from the user. In this case, the user is saved from having to understand things they really don't need to know about, and can be argued that this is a "Good" use of macros. Also, wxWidgets clearly documents how these macros are supposed to be used. So make sure that what you hide isn't something that is going to need to be understood by someone else coming in.
Or, if its just for your use, knock yourself out.

It's a question of where you're getting your value. Is typing those 15 extra characters in your loops really what's slowing your development down? Probably not. If you've got multiple lines of confusing, unavoidable boilerplate popping up all over the place, then you can and should look for ways to avoid repeating yourself, such as creating useful functions, cleaning up your class hierarchies, or using templates.
But the same optimization rules apply to writing code as to running it: optimizing small things with little effect is not really a good use of time or energy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js