C++ refactoring: conditional expansion and block elimination - c++

I'm in the process of refactoring a very large amount of code, mostly C++, to remove a number of temporary configuration checks which have become permanantly set to given values. So for example, I would have the following code:
#include <value1.h>
#include <value2.h>
#include <value3.h>
...
if ( value1() )
{
// do something
}
bool b = value2();
if ( b && anotherCondition )
{
// do more stuff
}
if ( value3() < 10 )
{
// more stuff again
}
where the calls to value return either a bool or an int. Since I know the values that these calls always return, I've done some regex substitution to expand the calls to their normal values:
// where:
// value1() == true
// value2() == false
// value3() == 4
// TODO: Remove expanded config (value1)
if ( true )
{
// do something
}
// TODO: Remove expanded config (value2)
bool b = false;
if ( b && anotherCondition )
{
// do more stuff
}
// TODO: Remove expanded config (value3)
if ( 4 < 10 )
{
// more stuff again
}
Note that although the values are fixed, they are not set at compile time but are read from shared memory so the compiler is not currently optimising anything away behind the scenes.
Although the resultant code looks a bit goofy, this regex approach achieves a lot of what I want since it's simple to apply and removes dependence on the calls, while not changing the behaviour of the code and it's also likely that the compiler may then optimise a lot of it out knowing that a block can never be called or a check will always return true. It also makes it reasonably easy (especially when diffing against version control) to see what has changed and take the final step of cleaning it up so the code above code eventually looks as follows:
// do something
// DONT do more stuff (b being false always prevented this)
// more stuff again
The trouble is that I have hundreds (possibly thousands) of changes to make to get from the second, correct but goofy, stage to get to the final cleaned code.
I wondered if anyone knew of a refactoring tool which might handle this or of any techniques I could apply. The main problem is that the C++ syntax makes full expansion or elimination quite difficult to achieve and there are many permutations to the code above. I feel I almost need a compiler to deal with the variation of syntax that I would need to cover.
I know there have been similar questions but I can't find any requirement quite like this and also wondered if any tools or procedures had emerged since they were asked?

It sounds like you have what I call "zombie code"... dead in practice, but still live as far as the compiler is concerned. This is a pretty common issue with most systems of organized runtime configuration variables: eventually some configuration variables arrive at a permanent fixed state, yet are reevaluated at runtime repeatedly.
The cure isn't regex, as you have noted, because regex doesn't parse C++ code reliably.
What you need is a program transformation system. This is a tool that really parses source code, and can apply a set of code-to-code rewriting rules to the parse tree, and can regenerate source text from the changed tree.
I understand that Clang has some capability here; it can parse C++ and build a tree, but it does not have source-to-source transformation capability. You can simulate that capability by writing AST-to-AST transformations but that's a lot more inconvenient IMHO. I believe it can regenerate C++ code but I don't know if it will preserve comments or preprocessor directives.
Our DMS Software Reengineering Toolkit with its C++(11) front end can (and has been used to) carry out massive transformations on C++ source code, and has source-to-source transformations. AFAIK, it is the only production tool that can do this. What you need is a set of transformations that represent your knowledge of the final state of the configuration variables of interest, and some straightforward code simplification rules. The following DMS rules are close to what you likely want:
rule fix_value1():expression->expression
"value1()" -> "true";
rule fix_value2():expression->expression
"value2()" -> "false";
rule fix_value3():expression->expression
"value3()" -> "4";
rule simplify_boolean_and_true(r:relation):condition->condition
"r && true" -> "r".
rule simplify_boolean_or_ture(r:relation):condition->condition
"r || true" -> "true".
rule simplify_boolean_and_false(r:relation):condition->condition
"r && false" -> "false".
...
rule simplify_boolean_not_true(r:relation):condition->condition
"!true" -> "false".
...
rule simplify_if_then_false(s:statement): statement->statement
" if (false) \s" -> ";";
rule simplify_if_then_true(s:statement): statement->statement
" if (true) \s" -> "\s";
rule simplify_if_then_else_false(s1:statement, s2:statement): statement->statement
" if (false) \s1 else \s2" -> "\s2";
rule simplify_if_then_else_true(s1:statement, s2: statement): statement->statement
" if (true) \s1 else \s2" -> "\s2";
You also need rules to simplify ("fold") constant expressions involving arithmetic, and rules to handle switch on expressions that are now constant. To see what DMS rules look like for integer constant folding see Algebra as a DMS domain.
Unlike regexes, DMS rewrite rules cannot "mismatch" code; they represent the corresponding ASTs and it is that ASTs that are matched. Because it is AST matching, they have no problems with whitespace, line breaks or comments. You might think they could have trouble with order of operands ('what if "false && x" is encountered?'); they do not, as the grammar rules for && and || are marked in the DMS C++ parser as associative and commutative and the matching process automatically takes that into account.
What these rules cannot do by themselves is value (in your case, constant) propagation across assignments. For this you need flow analysis so that you can trace such assignments ("reaching definitions"). Obviously, if you don't have such assignments or very few, you can hand patch those. If you do, you'll need the flow analysis; alas, DMS's C++ front isn't quite there but we are working on it; we have control flow analysis in place. (DMS's C front end has full flow analysis).
(EDIT February 2015: Now does full C++14; flow analysis within functions/methods).
We actually applied this technique to 1.5M SLOC application of mixed C and C++ code from IBM Tivoli almost a decade ago with excellent success; we didn't need the flow analysis :-}

You say:
Note that although the values are reasonably fixed, they are not set at compile time but are read from shared memory so the compiler is not currently optimising anything away behind the scenes.
Constant-folding the values by hand doesn't make a lot of sense unless they are completely fixed. If your compiler provides constexpr you could use that, or you could substitute in preprocessor macros like this:
#define value1() true
#define value2() false
#define value3() 4
The optimizer would take care of you from there. Without seeing examples of exactly what's in your <valueX.h> headers or knowing how your process of getting these values from shared memory is working, I'll just throw out that it could be useful to rename the existing valueX() functions and do a runtime check in case they change again in the future:
// call this at startup to make sure our agreed on values haven't changed
void check_values() {
assert(value1() == get_value1_from_shared_memory());
assert(value2() == get_value2_from_shared_memory());
assert(value3() == get_value3_from_shared_memory());
}

Related

Generating branch instructions from an AST for a conditional expression

I am trying to write a compiler for a domain-specific language, targeting a stack-machine based VM that is NOT a JVM.
I have already generated a parser for my language, and can readily produce an AST which I can easily walk. I also have had no problem converting many of the statements of my language into the appropriate instructions for this VM, but am facing an obstacle when it comes to the matter of handling the generation of appropriate branching instructions when complex conditionals are encountered, especially when they are combined with (possibly nested) 'and'-like or 'or' like operations which should use short-circuiting branching as applicable.
I am not asking anyone to write this for me. I know that I have not begun to describe my problem in sufficient detail for that. What I am asking for is pointers to useful material that can get me past this hurdle I am facing. As I said, I am already past the point of converting about 90% of the statements in my language into applicable instructions, but it is the handling of conditionals and generating the appropriate flow control instructions that has got me stumped. Much of the info that I have been able to find so far on generating code from an AST only seems to deal with the generation of code corresponding to simple imperative-like statements, but the handing of conditionals and flow control appears to be much more scarce.
Other than the short-circuiting/lazy-evaluation mechanism for 'and' and 'or' like constructs that I have described, I am not concerned with handling any other optimizations.
Every conditional control flow can be modelled as a flow chart (or flow graph) in which the two branches of the conditional have different targets. Given that boolean operators short-circuit, they are control flow elements rather than simple expressions, and they need to be modelled as such.
One way to think about this is to rephrase boolean operators as instances of the ternary conditional operator. So, for example, A and B becomes A ? B : false and A or B becomes A ? true : B [Note 1]. Note that every control flow diagram has precisely two output points.
To combine boolean expressions, just substitute into the diagram. For example, here's A AND (B OR C)
You implement NOT by simply swapping the meaning of the two out-flows.
If the eventual use of the boolean expression is some kind of conditional, such as an if statement or a conditional loop, you can use the control flow as is. If the boolean expression is to be saved into a variable or otherwise used as a value, you need to fill in the two outflows with code to create the relevant constant, usually a true or false boolean constant, or (in C-like languages) a 1 or 0.
Notes:
Another way to write this equivalence is A and B ⇒ A ? B : A; A or B ⇒ A ? A : B, but that is less useful for a control flow view, and also clouds the fact that the intent is to only evaluate each expression once. This form (modified to reuse the initial computation of A) is commonly used in languages with multiple "falsey" values (like Python).

Writing a Parser for a programming language: Output [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to write a simple interpreted programming language in C++. I've read that a lot of people use tools such Lex/Flex Bison to avoid "reinventing the wheel", but since my goal is to understand how these little beasts work improving my knowledge, i've decided to write the Lexer and the Parser from scratch. At the moment i'm working on the parser (the lexer is complete) and i was asking myself what should be its output. A tree? A linear vector of statements with a "depth" or "shift" parameter? How should i manage loops and if statements? Should i replace them with invisible goto statements?
A parser should almost always output an AST. An AST is simply, in the broadest sense, a tree representation of the syntactical structure of the program. A Function becomes an AST node containing the AST of the function body. An if becomes an AST node containing the AST of the condition and the body. A use of an operator becomes an AST node containing the AST of each operand. Integer literals, variable names, and so on become leaf AST nodes. Operator precedence and such is implicit in the relationship of the nodes: Both 1 * 2 + 3 and (1 * 2) + 3 are represented as Add(Mul(Int(1), Int(2)), Int(3)).
Many details of what's in the AST depend on your language (obviously) and what you want to do with the tree. If you want to analyze and transform the program (i.e. split out altered source code at the end), you might preserve comments. If you want detailed error messages, you might add source locations (as in, this integer literal was on line 5 column 12).
A compiler will proceed to turn the AST into a different format (e.g. a linear IR with gotos, or data flow graphs). Going through the AST is still a good idea, because a well-designed AST has a good balance of being syntax-oriented but only storing what's important for understanding the program. The parser can focus on parsing while the later transformations are protected from irrelevant details such as the amount of white space and operator precedence. Note that such a "compiler" might also output bytecode that's later interpreted (the reference implementation of Python does this).
A relatively pure interpreter might instead interpret the AST. Much has been written about this; it is about the easiest way to execute the parser's output. This strategy benefits from the AST in much the same way as a compiler; in particular most interpretation is simply top-down traversal of the AST.
The formal and most properly correct answer is going to be that you should return an Abstract Syntax Tree. But that is simultaneously the tip of an iceberg and no answer at all.
An AST is simply a structure of nodes describing the parse; a visualization of the paths your parse took thru the token/state machine.
Each node represents a path or description. For example, you would have nodes which represents language statements, nodes which represent compiler directives and nodes which represent data.
Consider a node which describes a variable, and lets say your language supports variables of int and string and the notion of "const". You may well choose to make the type a direct property of the Variable node struct/class, but typically in an AST you make properties - like constness - a "mutator", which is itself some form of node linked to the Variable node.
You could implement the C++ concept of "scope" by having locally-scoped variables as mutations of a BlockStatement node; the constraints of a "Loop" node (for, do, while, etc) as mutators.
When you closely tie your parser/tokenizer to your language implementation, it can become a nightmare making even small changes.
While this is true, if you actually want to understand how these things work, it is worth going through at least one first implementation where you begin to implement your runtime system (vm, interpreter, etc) and have your parser target it directly. (The alternative is, e.g., to buy a copy of the "Dragon Book" and read how it's supposed to be done, but it sounds like you are actually wanting to have the full understanding that comes from having worked thru the problem yourself).
The trouble with being told to return an AST is that an AST actually needs a form of parsing.
struct Node
{
enum class Type {
Variable,
Condition,
Statement,
Mutator,
};
Node* m_parent;
Node* m_next;
Node* m_child;
Type m_type;
string m_file;
size_t m_lineNo;
};
struct VariableMutatorNode : public Node
{
enum class Mutation {
Const
};
Mutation m_mutation;
// ...
};
struct VariableNode
{
VariableMutatorNode* m_mutators;
// ...
};
Node* ast; // Top level node in the AST.
This sort of AST is probably OK for a compiler that is independent of its runtime, but you'd need to tighten it up a lot for a complex, performance sensitive language down the (at which point there is less 'A' in 'AST').
The way you walk this tree is to start with the first node of 'ast' and act acording to it. If you're writing in C++, you can do this by attaching behaviors to each node type. But again, that's not so "abstract", is it?
Alternatively, you have to write something which works its way thru the tree.
switch (node->m_type) {
case Node::Type::Variable:
declareVariable(node);
break;
case Node::Type::Condition:
evaluate(node);
break;
case Node::Type::Statement:
execute(node);
break;
}
And as you write this, you'll find yourself thinking "wait, why didn't the parser do this for me?" because processing an AST often feels a lot like you did a crap job of implementing the AST :)
There are times when you can skip the AST and go straight to some form of final representation, and (rare) times when that is desirable; then there are times when you could go straight to some form of final representation but now you have to change the language and that decision will cost you a lot of reimplementation and headaches.
This is also generally the meat of building your compiler - the lexer and parser are generally the lesser parts of such an under taking. Working with the abstract/post-parse representation is a much more significant part of the work.
That's why people often go straight to flex/bison or antlr or some such.
And if that's what you want to do, looking at .NET or LLVM/Clang can be a good option, but you can also fairly easily bootstrap yourself with something like this: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/
Best of luck :)
I would build a tree of statements. After that, yes the goto statements are how the majority of it works (jumps and calls). Are you translating to a low level like assembly?
The output of the parser should be an abstract syntax tree, unless you know enough about writing compilers to directly produce byte-code, if that's your target language. It can be done in one pass but you need to know what you're doing. The AST expresses loops and ifs directly: you're not concerned with translating them yet. That comes under code generation.
People don't use lex/yacc to avoid re-inventing the wheel, the use it to build a more robust compiler prototype more quickly, with less effort, and to focus on the language, and avoid getting bogged down in other details. From personal experience with several VM projects, compilers and assemblers, I suggest if you want to learn how to build a language, do just that -- focus on building a language (first).
Don't get distracted with:
Writing your own VM or runtime
Writing your own parser generator
Writing your own intermediate language or assembler
You can do these later.
This is a common thing I see when a bright young computer scientist first catches the "language fever" (and its good thing to catch), but you need to be careful and focus your energy on the one thing you want to do well, and make use of other robust, mature technologies like parser generators, lexers, and runtime platforms. You can always circle back later, when you have slain the compiler dragon first.
Just spend your energy learning how a LALR grammar works, write your language grammar in Bison or Yacc++ if you can still find it, don't get distracted by people who say you should be using ANTLR or whatever else, that isn't the goal early on. Early on, you need to focus on crafting your language, removing ambiguities, creating a proper AST (maybe the most important skillset), semantic checking, symbol resolution, type resolution, type inference, implicit casting, tree rewriting, and of course, end program generation. There is enough to be done making a proper language that you don't need to be learning multiple other areas of research that some people spend their whole careers mastering.
I recommend you target an existing runtime like the CLR (.NET). It is one of the best runtimes for crafting a hobby language. Get your project off the ground using a textual output to IL, and assemble with ilasm. ilasm is relatively easy to debug, assuming you put some time into learning it. Once you get a prototype going, you can then start thinking about other things like an alternate output to your own interpreter, in case you have language features that are too dynamic for the CLR (then look at the DLR). The main point here is that CLR provides a good intermediate representation to output to. Don't listen to anyone that tells you you should be directly outputting bytecode. Text is king for learning in the early stages and allows you to plug and play with different languages / tools. A good book is by the author John Gough, titled Compiling for the .NET Common Language Runtime (CLR) and he takes you through the implementation of the Gardens Point Pascal Compiler, but it isn't a book about Pascal, it is a book about how to build a real compiler on the CLR. It will answer many of your questions on implementing loops and other high level constructs.
Related to this, a great tool for learning is to use Visual Studio and ildasm (the disassembler) and .NET Reflector. All available for free. You can write small code samples, compile them, then disassemble them to see how they map to a stack based IL.
If you aren't interested in the CLR for whatever reason, there are other options out there. You will probably run across llvm, Mono, NekoVM, and Parrot (all good things to learn) in your searches. I was an original Parrot VM / Perl 6 developer, and wrote the Perl Intermediate Representation language and imcc compiler (which is quite a terrible piece of code I might add) and the first prototype Perl 6 compiler. I suggest you stay away from Parrot and stick with something easier like .NET CLR, you'll get much further. If, however, you want to build a real dynamic language, and want to use Parrot for its continuations and other dynamic features, see the O'Reilly Books Perl and Parrot Essentials (there are several editions), the chapters on PIR/IMCC are about my stuff, and are useful. If your language isn't dynamic, then stay far away from Parrot.
If you are bent on writing your own VM, let me suggest you prototype the VM in Perl, Python or Ruby. I have done this a couple of times with success. It allows you to avoid too much implementation early, until your language starts to mature. Perl+Regex are easy to tweak. An intermediate language assembler in Perl or Python takes a few days to write. Later, you can rewrite the 2nd version in C++ if you still feel like it.
All this I can sum up with: avoid premature optimizations, and avoid trying to do everything at once.
First you need to get a good book. So I refer you to the book by John Gough in my other answer, but emphasize, focus on learning to implement an AST for a single, existing platform first. It will help you learn about AST implementation.
How to implement a loop?
Your language parser should return a tree node during the reduce step for the WHILE statement. You might name your AST class WhileStatement, and the WhileStatement has, as members, ConditionExpression and BlockStatement and several labels (also inheritable but I added inline for clarity).
Grammar pseudocode below, shows the how the reduce creates a new object of WhileStatement from a typical shift-reduce parser reduction.
How does a shift-reduce parser work?
WHILE ( ConditionExpression )
BlockStatement
{
$$ = new WhileStatement($3, $5);
statementList.Add($$); // this is your statement list (AST nodes), not the parse stack
}
;
As your parser sees "WHILE", it shifts the token on the stack. And so forth.
parseStack.push(WHILE);
parseStack.push('(');
parseStack.push(ConditionalExpression);
parseStack.push(')');
parseStack.push(BlockStatement);
The instance of WhileStatement is a node in a linear statement list. So behind the scenes, the "$$ =" represents a parse reduce (though if you want to be pedantic, $$ = ... is user-code, and the parser is doing its own reductions implicitly, regardless). The reduce can be thought of as popping off the tokens on the right side of the production, and replacing with the single token on the left side, reducing the stack:
// shift-reduce
parseStack.pop_n(5); // pop off the top 5 tokens ($1 = WHILE, $2 = (, $3 = ConditionExpression, etc.)
parseStack.push(currToken); // replace with the current $$ token
You still need to add your own code to add statements to a linked list, with something like "statements.add(whileStatement)" so you can traverse this later. The parser has no such data structure, and its stacks are only transient.
During parse, synthesize a WhileStatement instance with its appropriate members.
In latter phase, implement the visitor pattern to visit each statement and resolve symbols and generate code. So a while loop might be implemented with the following AST C++ class:
class WhileStatement : public CompoundStatement {
public:
ConditionExpression * condExpression; // this is the conditional check
Label * startLabel; // Label can simply be a Symbol
Label * redoLabel; // Label can simply be a Symbol
Label * endLabel; // Label can simply be a Symbol
BlockStatement * loopStatement; // this is the loop code
bool ResolveSymbolsAndTypes();
bool SemanticCheck();
bool Emit(); // emit code
}
Your code generator needs to have a function that generates sequential labels for your assembler. A simple implementation is a function to return a string with a static int that increments, and returns LBL1, LBL2, LBL3, etc. Your labels can be symbols, or you might get fancy with a Label class, and use a constructor for new Labels:
class Label : public Symbol {
public Label() {
this.name = newLabel(); // incrementing LBL1, LBL2, LBL3
}
}
A loop is implemented by generating the code for condExpression, then the redoLabel, then the blockStatement, and at the end of blockStatement, then goto to redoLabel.
A sample from one of my compilers to generate code for the CLR.
// Generate code for .NET CLR for While statement
//
void WhileStatement::clr_emit(AST *ctx)
{
redoLabel = compiler->mkLabelSym();
startLabel = compiler->mkLabelSym();
endLabel = compiler->mkLabelSym();
// Emit the redo label which is the beginning of each loop
compiler->out("%s:\n", redoLabel->getName());
if(condExpr) {
condExpr->clr_emit_handle();
condExpr->clr_emit_fetch(this, t_bool);
// Test the condition, if false, branch to endLabel, else fall through
compiler->out("brfalse %s\n", endLabel->getName());
}
// The body of the loop
compiler->out("%s:\n", startLabel->getName()); // start label only for clarity
loopStmt->clr_emit(this); // generate code for the block
// End label, jump out of loop
compiler->out("br %s\n", redoLabel->getName()); // goto redoLabel
compiler->out("%s:\n", endLabel->getName()); // endLabel for goto out of loop
}

How can I convert if to switch? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Is there a way to convert C++ ifs to switches? Both nested or non-nested ifs:
if ( a == 1 ) { b = 1; } /*else*/
if ( a == 2 ) { b = 7; } /*else*/
if ( a == 3 ) { b = 3; }
It should, of course, detect if the conversion is valid or not.
This isn't a micro-optimization, it's for clarity purposes.
In general, you ought to be able to do this with a program transformation tool that can process C++.
In specific, our DMS Software Reengineering Toolkit can to do this, using source-to-source rewrites. I think these are close to doing the job:
default domain Cpp;
rule fold_if_to_switch_initial(s: stmt_sequence, e: expression,
k1: integer_literal, k2: integer_literal,
a1: block, a2: block):
stmt_sequence->stmt_sequence
= "\s
if (\e==\k1) \a1
if (\e==\k2) \a2 "
-> "\s
switch (\e) {
case \k1: \a1
break;
case \k2: \a2
break;
}" if no_side_effects(e) /\ no_impact_on(a1,e);
rule fold_if_to_switch_expand(s: stmt_sequence, e: expression, c: casebody,
k: integer_literal, a:action)
stmt_sequence->stmt_sequence
= "\s
switch (\e) { \c }
if (\e==\k) \a "
-> "\s
switch (\e) {
\c
case \k: \a
break;
}" if no_side_effects(e) /\ no_impact_on(c,e);
One has to handle the cases where the if controls a statement rather than a block, cheesily accomplished by this:
rule blockize_if_statements:(e: expression, s: statement):
statement->statement
= "if (\e) \s" -> "if (\e) { \s } ";
DMS works by using a full language parser (yes, it really has a full C++ parser option with preprocessor)
to process the source code and the rewrite rules. It isn't fooled by whitespace or comments, because it is operating on the parse trees. The text inside "..." is C++ text
(that's what the 'default domain Cpp' declaration says) with pattern variables \x; text outside the "..." is DMS's rule meta syntax, not C++. Patterns match the trees and bind pattern variables to subtrees; rule right hand sides are instantiated use pattern variable bindings. After applying the transformations,
it regenerates the (revised) source code from the revised trees.
OP insists, "It should, of course, detect if the conversion is valid or not.". This is the point of the conditions at the end of the rules, e.g., 'if no_side_effect ... no_impact', which are intended to check that evaluating the expression doesn't change something, and that the actions don't affect the value of the expression.
Implementing these semantic checks is difficult, as one must take into account C++;s incredibly complex semantics. A tool doing this must know at least as much as the compiler (e.g., names, types, properties of declarations, reads and writes), and then has to be able to reason about the consequences. DMS at present is able to implement only some of these (we are working on full control and data flow analysis for C++11 now). I would not expect other refactoring tools to do this part well, either, because of the complex semantics; I think it unlikely that situation will improve because of the effort it takes to reason about C++.
In OP's case (and this occurs fairly often when using a tool like DMS), one can assert that in fact these semantic predicates don't get violated (then you can replace them with "true" and thus avoid implementing them) or that you only care about special cases (e.g., the action is an assignment to a variable other than on in the expression), at which point the check becomes much simpler, or you can simulate the check by narrowing the acceptable syntax:
rule fold_if_to_switch_expand(s: stmt_sequence, v: identifier, c: casebody,
k: integer_literal, v2: identifier, k2: integer_literal)
stmt_sequence->stmt_sequence
= "\s
switch (\v) { \c }
if (\v==\k) \v2=\k2; "
-> "\s
switch (\v) {
\c
case \k: \v2=\k2;
break;
}" if no_side_effects(c);
Having said all this, if OP has only a few hundred known places that need this improvement, he'll probably get it done faster by biting the bullet and using his editor. If he doesn't know the locations, and/or there are many more instances, DMS would get the job done faster and likely with many fewer mistakes.
(We've done essentially this same refactoring for IBM Enterprise COBOL using DMS,
without the deep semantic checks).
There are other program transformation tools with the right kind of machinery (pattern matching, rewriting) that can do this in theory. None of them to my knowledge make any effective attempt to handle C++.
I'd use a text editor.
Step 1:
put MARK before your block of code, with possible leading tabs
Step 2:
Put END at the end of your block of code, with possible leading tabs
Step 3:
Load your file in vi of some kind. Run something like this:
/^^T*MARK/
:.,/^^T*ENDMARK/ s/^\(^T*\)if/\1IIFF/g
:%s/IIFF *([^)]*) *{ *\([^}]*\)};? *$/TEST{\1}TSET {\2} break;/
:%s/TEST{\([^=}]*\)==\([^}]*\)}TSET/COND{\1}case \2:/g
?^^T*MARK?
/COND{/
:. s/COND{\([^}]*\)}/switch(\1) {^M^T/
:%s/COND{\([^}]*\)}/^T/g
:%s/^\(^T*\)ENDMARK/\1} \/\/switch/g
:%s/^^TMARK//g
where ^T is tab and ^M is a return character (which you may not be able to type directly, depending on what version of vi/vim you are using)
Code is written off the top of my head, and not tested. But I've done things like this before. Also, files containing TEST{whatever}TSET and COND{whatever} and MARK and ENDMARK and IIFF type constructs will be horribly mangled.
This does not check validity, it just converts things in exactly your format into a switch statement.
Limited validity checking could be done with some fancy-ass (for vi) making sure that the left hand side of the equality expression is the same for all lines. It also doesn't handle else, but that is easily rectified.
Then, before checking in, you do a visual sweep of the modified code using a good comparison tool.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

Does replacing statements by expressions using the C++ comma operator could allow more compiler optimizations?

The C++ comma operator is used to chain individual expressions, yielding the value of the last executed expression as the result.
For example the skeleton code (6 statements, 6 expressions):
step1;
step2;
if (condition)
step3;
return step4;
else
return step5;
May be rewritten to: (1 statement, 6 expressions)
return step1,
step2,
condition?
step3, step4 :
step5;
I noticed that it is not possible to perform step-by-step debugging of such code, as the expression chain seems to be executed as a whole. Does it means that the compiler is able to perform special optimizations which are not possible with the traditional statement approach (specially if the steps are const or inline)?
Note: I'm not talking about the coding style merit of that way of expressing sequence of expressions! Just about the possible optimisations allowed by replacing statements by expressions.
Most compilers will break your code down into "basic blocks", which are stretches of code with no jumps/branches in or out. Optimisations will be performed on a graph of these blocks: that graph captures all the control flow in the function. The basic blocks are equivalent in your two versions of the code, so I doubt that you'd get different optimisations. That the basic blocks are the same isn't entirely obvious: it relies on the fact that the control flow between the steps is the same in both cases, and so are the sequence points. The most plausible difference is that you might find in the second case there is only one block including a "return", and in the first case there are two. The blocks are still equivalent, since the optimiser can replace two blocks that "do the same thing" with one block that is jumped to from two different places. That's a very common optimisation.
It's possible, of course, that a particular compiler doesn't ignore or eliminate the differences between your two functions when optimising. But there's really no way of saying whether any differences would make the result faster or slower, without examining what that compiler is doing. In short there's no difference between the possible optimisations, but it doesn't necessarily follow that there's no difference between the actual optimisations.
The reason you can't single-step your second version of the code is just down to how the debugger works, not the compiler. Single-step usually means, "run to the next statement", so if you break your code into multiple statements, you can more easily debug each one. Otherwise, if your debugger has an assembly view, then in the second case you could switch to that and single-step the assembly, allowing you to see how it progresses. Or if any of your steps involve function calls, then you may be able to "do the hokey-cokey", by repeatedly doing "step in, step out" of the functions, and separate them that way.
Using the comma operator neither promotes nor hinders optimization in any circumstances I'm aware of, because the C++ standard guarantee is only that evaluation will be in left-to-right order, not that statement execution necessarily will be. (This is the same guarantee you get with statement line order.)
What it is likely to do, though, is turn your code into a confusing mess, since many programmers are unaware that the comma-as-operator even exists, and are apt to confuse it with commas used as parameter separators. (Want to really make your code unreadable? Call a function like my_func((++i, y), x).)
The "best" use of the comma operator I've seen is to work with multiple variables in the iteration statement of a for loop:
for (int i = 0, j = 0;
i < 10 && j < 12;
i += j, ++j) // each time through the loop we're tinkering with BOTH i and j
{
}
Very unlikely IMHO. The thing get's compiled down to assembler/machine code, then further low-level optimizations are done, so it probably turns out to the same thing.
OTOH, if the comma operator is overloaded, the game changes completely. But I'm sure you know that. ;)
The obligatory list:
Don't worry about rewriting almost equivalent code to gain performance
If you have a perf-problem, profile to see what the problem is
If you can't get it faster by algorithmic ops, look at the disassembly and see that the compiler does what you intended
If not, ask here and post source and disassembly for both versions. :)