This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Is there a way to convert C++ ifs to switches? Both nested or non-nested ifs:
if ( a == 1 ) { b = 1; } /*else*/
if ( a == 2 ) { b = 7; } /*else*/
if ( a == 3 ) { b = 3; }
It should, of course, detect if the conversion is valid or not.
This isn't a micro-optimization, it's for clarity purposes.
In general, you ought to be able to do this with a program transformation tool that can process C++.
In specific, our DMS Software Reengineering Toolkit can to do this, using source-to-source rewrites. I think these are close to doing the job:
default domain Cpp;
rule fold_if_to_switch_initial(s: stmt_sequence, e: expression,
k1: integer_literal, k2: integer_literal,
a1: block, a2: block):
stmt_sequence->stmt_sequence
= "\s
if (\e==\k1) \a1
if (\e==\k2) \a2 "
-> "\s
switch (\e) {
case \k1: \a1
break;
case \k2: \a2
break;
}" if no_side_effects(e) /\ no_impact_on(a1,e);
rule fold_if_to_switch_expand(s: stmt_sequence, e: expression, c: casebody,
k: integer_literal, a:action)
stmt_sequence->stmt_sequence
= "\s
switch (\e) { \c }
if (\e==\k) \a "
-> "\s
switch (\e) {
\c
case \k: \a
break;
}" if no_side_effects(e) /\ no_impact_on(c,e);
One has to handle the cases where the if controls a statement rather than a block, cheesily accomplished by this:
rule blockize_if_statements:(e: expression, s: statement):
statement->statement
= "if (\e) \s" -> "if (\e) { \s } ";
DMS works by using a full language parser (yes, it really has a full C++ parser option with preprocessor)
to process the source code and the rewrite rules. It isn't fooled by whitespace or comments, because it is operating on the parse trees. The text inside "..." is C++ text
(that's what the 'default domain Cpp' declaration says) with pattern variables \x; text outside the "..." is DMS's rule meta syntax, not C++. Patterns match the trees and bind pattern variables to subtrees; rule right hand sides are instantiated use pattern variable bindings. After applying the transformations,
it regenerates the (revised) source code from the revised trees.
OP insists, "It should, of course, detect if the conversion is valid or not.". This is the point of the conditions at the end of the rules, e.g., 'if no_side_effect ... no_impact', which are intended to check that evaluating the expression doesn't change something, and that the actions don't affect the value of the expression.
Implementing these semantic checks is difficult, as one must take into account C++;s incredibly complex semantics. A tool doing this must know at least as much as the compiler (e.g., names, types, properties of declarations, reads and writes), and then has to be able to reason about the consequences. DMS at present is able to implement only some of these (we are working on full control and data flow analysis for C++11 now). I would not expect other refactoring tools to do this part well, either, because of the complex semantics; I think it unlikely that situation will improve because of the effort it takes to reason about C++.
In OP's case (and this occurs fairly often when using a tool like DMS), one can assert that in fact these semantic predicates don't get violated (then you can replace them with "true" and thus avoid implementing them) or that you only care about special cases (e.g., the action is an assignment to a variable other than on in the expression), at which point the check becomes much simpler, or you can simulate the check by narrowing the acceptable syntax:
rule fold_if_to_switch_expand(s: stmt_sequence, v: identifier, c: casebody,
k: integer_literal, v2: identifier, k2: integer_literal)
stmt_sequence->stmt_sequence
= "\s
switch (\v) { \c }
if (\v==\k) \v2=\k2; "
-> "\s
switch (\v) {
\c
case \k: \v2=\k2;
break;
}" if no_side_effects(c);
Having said all this, if OP has only a few hundred known places that need this improvement, he'll probably get it done faster by biting the bullet and using his editor. If he doesn't know the locations, and/or there are many more instances, DMS would get the job done faster and likely with many fewer mistakes.
(We've done essentially this same refactoring for IBM Enterprise COBOL using DMS,
without the deep semantic checks).
There are other program transformation tools with the right kind of machinery (pattern matching, rewriting) that can do this in theory. None of them to my knowledge make any effective attempt to handle C++.
I'd use a text editor.
Step 1:
put MARK before your block of code, with possible leading tabs
Step 2:
Put END at the end of your block of code, with possible leading tabs
Step 3:
Load your file in vi of some kind. Run something like this:
/^^T*MARK/
:.,/^^T*ENDMARK/ s/^\(^T*\)if/\1IIFF/g
:%s/IIFF *([^)]*) *{ *\([^}]*\)};? *$/TEST{\1}TSET {\2} break;/
:%s/TEST{\([^=}]*\)==\([^}]*\)}TSET/COND{\1}case \2:/g
?^^T*MARK?
/COND{/
:. s/COND{\([^}]*\)}/switch(\1) {^M^T/
:%s/COND{\([^}]*\)}/^T/g
:%s/^\(^T*\)ENDMARK/\1} \/\/switch/g
:%s/^^TMARK//g
where ^T is tab and ^M is a return character (which you may not be able to type directly, depending on what version of vi/vim you are using)
Code is written off the top of my head, and not tested. But I've done things like this before. Also, files containing TEST{whatever}TSET and COND{whatever} and MARK and ENDMARK and IIFF type constructs will be horribly mangled.
This does not check validity, it just converts things in exactly your format into a switch statement.
Limited validity checking could be done with some fancy-ass (for vi) making sure that the left hand side of the equality expression is the same for all lines. It also doesn't handle else, but that is easily rectified.
Then, before checking in, you do a visual sweep of the modified code using a good comparison tool.
Related
Taken from Introduction to Ada—If expressions:
Ada's if expressions are similar to if statements. However, there are a few differences that stem from the fact that it is an expression:
All branches' expressions must be of the same type
It must be surrounded by parentheses if the surrounding expression does not already contain them
An else branch is mandatory unless the expression following then has a Boolean value. In that case an else branch is optional and, if not present, defaults to else True.
I do not understand the need to have two different ways of constructing code with the if keyword. What is the reasoning behind this?
Also there case expressions and case statements. Why is this?
I think this is best answered by quoting the Ada 2012 Rationale Chapter 3.1:
One of the key areas identified by the WG9 guidance document [1] as
needing attention was improving the ability to write and enforce
contracts. These were discussed in detail in the previous chapter.
When defining the new aspects for preconditions, postconditions, type
invariants and subtype predicates it became clear that without more
flexible forms of expressions, many functions would need to be
introduced because in all cases the aspect was given by an expression.
However, declaring a function and thus giving the detail of the
condition, invariant or predicate in the function body makes the
detail of the contract rather remote for the human reader. Information
hiding is usually a good thing but in this case, it just introduces
obscurity. Four forms are introduced, namely, if expressions, case
expressions, quantified expressions and expression functions. Together
they give Ada some of the flexible feel of a functional language.
In addition, if statements and case statements often assigns different values to the same variable in all branches, and nothing else:
if Foo > 10 then
Bar := 1;
else
Bar := 2;
end if;
In this case, an if expression may increase readability and more clearly state in the code what's going on:
Bar := (if Foo > 10 then 1 else 2);
We can now see that there's no longer a need for the maintainer of the code to read a whole if statement in order to see that only a single variable is updated.
Same goes for case expressions, which can also reduce the need for nesting if expressions.
Also, I can throw the question back to you: Why does C-based languages have the ternary operator ?: in addition to if statements?
Egilhh already covered the main reason, but there are sometimes other useful reasons to implement expressions. Sometimes you make packages where only one or two methods are needed and they are the only reason to make a package body. You can use expressions to make expression functions which allow you to define the operations in the spec file.
Additionally, if you ever end up with some complex variant record combinations, sometimes expressions can be used to setup default values for them in instances where you normally would not be able to as cleanly. Consider the following example:
with Ada.Text_IO; use Ada.Text_IO;
procedure Hello is
type Binary_Type is (On, Off);
type Inner(Binary : Binary_Type := Off) is record
case Binary is
when On =>
Value : Integer := 0;
when Off =>
null;
end case;
end record;
type Outer(Some_Flag : Boolean) is record
Other : Integer := 32;
Thing : Inner := (if Some_Flag then
(Binary => Off)
else
(Binary => On, Value => 23));
end record;
begin
Put_Line("Hello, world!");
end Hello;
I had something come up with a more complex setup that was meant to map to a complex messaging interface at the hardware level. It's nice to have defaults whenever possible. Now I cold have used a case inside of Outer, but then I would have had to come up with two separately named versions of the message field for each case, which really isn't optimal when you want your code to map to an ICD. Again, I could have used a function to initialize it as well, but as noted in the other posters answer, that isn't always a good way to go.
Another place that outlines the motivation for adding conditional expressions to Ada can be found in the ARG document, AI05-0147-1, which explains the motivation and gives some examples of use.
An example of a place where I find them quite useful is in processing command line parameters, for the case when a default value is used if the parameter is not specified on the command line. Generally, you'd want to declare such values as constants in one's program. Conditional expressions makes it easier to do that.
with Ada.Command_Line; use Ada;
procedure Main
is
N : constant Positive :=
(if Command_Line.Argument_Count = 0 then 2_000_000
else Positive'Value (Command_Line.Argument (1)));
...
Otherwise, without conditional expressions, in order to achieve the same effect you'd need to declare a function, which I find to be more difficult to read;
with Ada.Command_Line; use Ada;
procedure Main
is
function Get_N return Positive is
begin
if Command_Line.Argument_Count = 0 then
return 2_000_000;
else
return Positive'Value (Command_Line.Argument (1));
end if;
end Get_N;
N : constant Positive := Get_N;
...
The if expression in Ada feels and works a lot like a statement using the ternary operator in the C-based languages. I took the liberty of copying some code from learn.adacore.com that introduces the if expression:
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Integer_Text_IO; use Ada.Integer_Text_IO;
procedure Check_Positive is
N : Integer;
begin
Put ("Enter an integer value: ");
Get (N);
Put (N,0);
declare
S : constant String :=
(if N > 0 then " is a positive number"
else " is not a positive number");
begin
Put_Line (S);
end;
end Check_Positive;
And I translated it to a C-based language - in this case, Java. I believe the main point to notice is that both languages, although syntactically different, are effectively doing the same thing: testing a condition and assigning one of two values to a variable all within one statement. Although I realize this is an oversimplification for most here on stackoverlfow. My goal is to help the beginner to understand the basic concept with introductory examples. Cheers.
import java.util.Scanner;
public class IfExpression {
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
System.out.print("Enter an integer value: ");
var N = in.nextInt();
System.out.print(N);
var S = N > 0 ? " is a positive number" : " is not a positive number";
System.out.println(S);
in.close();
}
}
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to write a simple interpreted programming language in C++. I've read that a lot of people use tools such Lex/Flex Bison to avoid "reinventing the wheel", but since my goal is to understand how these little beasts work improving my knowledge, i've decided to write the Lexer and the Parser from scratch. At the moment i'm working on the parser (the lexer is complete) and i was asking myself what should be its output. A tree? A linear vector of statements with a "depth" or "shift" parameter? How should i manage loops and if statements? Should i replace them with invisible goto statements?
A parser should almost always output an AST. An AST is simply, in the broadest sense, a tree representation of the syntactical structure of the program. A Function becomes an AST node containing the AST of the function body. An if becomes an AST node containing the AST of the condition and the body. A use of an operator becomes an AST node containing the AST of each operand. Integer literals, variable names, and so on become leaf AST nodes. Operator precedence and such is implicit in the relationship of the nodes: Both 1 * 2 + 3 and (1 * 2) + 3 are represented as Add(Mul(Int(1), Int(2)), Int(3)).
Many details of what's in the AST depend on your language (obviously) and what you want to do with the tree. If you want to analyze and transform the program (i.e. split out altered source code at the end), you might preserve comments. If you want detailed error messages, you might add source locations (as in, this integer literal was on line 5 column 12).
A compiler will proceed to turn the AST into a different format (e.g. a linear IR with gotos, or data flow graphs). Going through the AST is still a good idea, because a well-designed AST has a good balance of being syntax-oriented but only storing what's important for understanding the program. The parser can focus on parsing while the later transformations are protected from irrelevant details such as the amount of white space and operator precedence. Note that such a "compiler" might also output bytecode that's later interpreted (the reference implementation of Python does this).
A relatively pure interpreter might instead interpret the AST. Much has been written about this; it is about the easiest way to execute the parser's output. This strategy benefits from the AST in much the same way as a compiler; in particular most interpretation is simply top-down traversal of the AST.
The formal and most properly correct answer is going to be that you should return an Abstract Syntax Tree. But that is simultaneously the tip of an iceberg and no answer at all.
An AST is simply a structure of nodes describing the parse; a visualization of the paths your parse took thru the token/state machine.
Each node represents a path or description. For example, you would have nodes which represents language statements, nodes which represent compiler directives and nodes which represent data.
Consider a node which describes a variable, and lets say your language supports variables of int and string and the notion of "const". You may well choose to make the type a direct property of the Variable node struct/class, but typically in an AST you make properties - like constness - a "mutator", which is itself some form of node linked to the Variable node.
You could implement the C++ concept of "scope" by having locally-scoped variables as mutations of a BlockStatement node; the constraints of a "Loop" node (for, do, while, etc) as mutators.
When you closely tie your parser/tokenizer to your language implementation, it can become a nightmare making even small changes.
While this is true, if you actually want to understand how these things work, it is worth going through at least one first implementation where you begin to implement your runtime system (vm, interpreter, etc) and have your parser target it directly. (The alternative is, e.g., to buy a copy of the "Dragon Book" and read how it's supposed to be done, but it sounds like you are actually wanting to have the full understanding that comes from having worked thru the problem yourself).
The trouble with being told to return an AST is that an AST actually needs a form of parsing.
struct Node
{
enum class Type {
Variable,
Condition,
Statement,
Mutator,
};
Node* m_parent;
Node* m_next;
Node* m_child;
Type m_type;
string m_file;
size_t m_lineNo;
};
struct VariableMutatorNode : public Node
{
enum class Mutation {
Const
};
Mutation m_mutation;
// ...
};
struct VariableNode
{
VariableMutatorNode* m_mutators;
// ...
};
Node* ast; // Top level node in the AST.
This sort of AST is probably OK for a compiler that is independent of its runtime, but you'd need to tighten it up a lot for a complex, performance sensitive language down the (at which point there is less 'A' in 'AST').
The way you walk this tree is to start with the first node of 'ast' and act acording to it. If you're writing in C++, you can do this by attaching behaviors to each node type. But again, that's not so "abstract", is it?
Alternatively, you have to write something which works its way thru the tree.
switch (node->m_type) {
case Node::Type::Variable:
declareVariable(node);
break;
case Node::Type::Condition:
evaluate(node);
break;
case Node::Type::Statement:
execute(node);
break;
}
And as you write this, you'll find yourself thinking "wait, why didn't the parser do this for me?" because processing an AST often feels a lot like you did a crap job of implementing the AST :)
There are times when you can skip the AST and go straight to some form of final representation, and (rare) times when that is desirable; then there are times when you could go straight to some form of final representation but now you have to change the language and that decision will cost you a lot of reimplementation and headaches.
This is also generally the meat of building your compiler - the lexer and parser are generally the lesser parts of such an under taking. Working with the abstract/post-parse representation is a much more significant part of the work.
That's why people often go straight to flex/bison or antlr or some such.
And if that's what you want to do, looking at .NET or LLVM/Clang can be a good option, but you can also fairly easily bootstrap yourself with something like this: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/
Best of luck :)
I would build a tree of statements. After that, yes the goto statements are how the majority of it works (jumps and calls). Are you translating to a low level like assembly?
The output of the parser should be an abstract syntax tree, unless you know enough about writing compilers to directly produce byte-code, if that's your target language. It can be done in one pass but you need to know what you're doing. The AST expresses loops and ifs directly: you're not concerned with translating them yet. That comes under code generation.
People don't use lex/yacc to avoid re-inventing the wheel, the use it to build a more robust compiler prototype more quickly, with less effort, and to focus on the language, and avoid getting bogged down in other details. From personal experience with several VM projects, compilers and assemblers, I suggest if you want to learn how to build a language, do just that -- focus on building a language (first).
Don't get distracted with:
Writing your own VM or runtime
Writing your own parser generator
Writing your own intermediate language or assembler
You can do these later.
This is a common thing I see when a bright young computer scientist first catches the "language fever" (and its good thing to catch), but you need to be careful and focus your energy on the one thing you want to do well, and make use of other robust, mature technologies like parser generators, lexers, and runtime platforms. You can always circle back later, when you have slain the compiler dragon first.
Just spend your energy learning how a LALR grammar works, write your language grammar in Bison or Yacc++ if you can still find it, don't get distracted by people who say you should be using ANTLR or whatever else, that isn't the goal early on. Early on, you need to focus on crafting your language, removing ambiguities, creating a proper AST (maybe the most important skillset), semantic checking, symbol resolution, type resolution, type inference, implicit casting, tree rewriting, and of course, end program generation. There is enough to be done making a proper language that you don't need to be learning multiple other areas of research that some people spend their whole careers mastering.
I recommend you target an existing runtime like the CLR (.NET). It is one of the best runtimes for crafting a hobby language. Get your project off the ground using a textual output to IL, and assemble with ilasm. ilasm is relatively easy to debug, assuming you put some time into learning it. Once you get a prototype going, you can then start thinking about other things like an alternate output to your own interpreter, in case you have language features that are too dynamic for the CLR (then look at the DLR). The main point here is that CLR provides a good intermediate representation to output to. Don't listen to anyone that tells you you should be directly outputting bytecode. Text is king for learning in the early stages and allows you to plug and play with different languages / tools. A good book is by the author John Gough, titled Compiling for the .NET Common Language Runtime (CLR) and he takes you through the implementation of the Gardens Point Pascal Compiler, but it isn't a book about Pascal, it is a book about how to build a real compiler on the CLR. It will answer many of your questions on implementing loops and other high level constructs.
Related to this, a great tool for learning is to use Visual Studio and ildasm (the disassembler) and .NET Reflector. All available for free. You can write small code samples, compile them, then disassemble them to see how they map to a stack based IL.
If you aren't interested in the CLR for whatever reason, there are other options out there. You will probably run across llvm, Mono, NekoVM, and Parrot (all good things to learn) in your searches. I was an original Parrot VM / Perl 6 developer, and wrote the Perl Intermediate Representation language and imcc compiler (which is quite a terrible piece of code I might add) and the first prototype Perl 6 compiler. I suggest you stay away from Parrot and stick with something easier like .NET CLR, you'll get much further. If, however, you want to build a real dynamic language, and want to use Parrot for its continuations and other dynamic features, see the O'Reilly Books Perl and Parrot Essentials (there are several editions), the chapters on PIR/IMCC are about my stuff, and are useful. If your language isn't dynamic, then stay far away from Parrot.
If you are bent on writing your own VM, let me suggest you prototype the VM in Perl, Python or Ruby. I have done this a couple of times with success. It allows you to avoid too much implementation early, until your language starts to mature. Perl+Regex are easy to tweak. An intermediate language assembler in Perl or Python takes a few days to write. Later, you can rewrite the 2nd version in C++ if you still feel like it.
All this I can sum up with: avoid premature optimizations, and avoid trying to do everything at once.
First you need to get a good book. So I refer you to the book by John Gough in my other answer, but emphasize, focus on learning to implement an AST for a single, existing platform first. It will help you learn about AST implementation.
How to implement a loop?
Your language parser should return a tree node during the reduce step for the WHILE statement. You might name your AST class WhileStatement, and the WhileStatement has, as members, ConditionExpression and BlockStatement and several labels (also inheritable but I added inline for clarity).
Grammar pseudocode below, shows the how the reduce creates a new object of WhileStatement from a typical shift-reduce parser reduction.
How does a shift-reduce parser work?
WHILE ( ConditionExpression )
BlockStatement
{
$$ = new WhileStatement($3, $5);
statementList.Add($$); // this is your statement list (AST nodes), not the parse stack
}
;
As your parser sees "WHILE", it shifts the token on the stack. And so forth.
parseStack.push(WHILE);
parseStack.push('(');
parseStack.push(ConditionalExpression);
parseStack.push(')');
parseStack.push(BlockStatement);
The instance of WhileStatement is a node in a linear statement list. So behind the scenes, the "$$ =" represents a parse reduce (though if you want to be pedantic, $$ = ... is user-code, and the parser is doing its own reductions implicitly, regardless). The reduce can be thought of as popping off the tokens on the right side of the production, and replacing with the single token on the left side, reducing the stack:
// shift-reduce
parseStack.pop_n(5); // pop off the top 5 tokens ($1 = WHILE, $2 = (, $3 = ConditionExpression, etc.)
parseStack.push(currToken); // replace with the current $$ token
You still need to add your own code to add statements to a linked list, with something like "statements.add(whileStatement)" so you can traverse this later. The parser has no such data structure, and its stacks are only transient.
During parse, synthesize a WhileStatement instance with its appropriate members.
In latter phase, implement the visitor pattern to visit each statement and resolve symbols and generate code. So a while loop might be implemented with the following AST C++ class:
class WhileStatement : public CompoundStatement {
public:
ConditionExpression * condExpression; // this is the conditional check
Label * startLabel; // Label can simply be a Symbol
Label * redoLabel; // Label can simply be a Symbol
Label * endLabel; // Label can simply be a Symbol
BlockStatement * loopStatement; // this is the loop code
bool ResolveSymbolsAndTypes();
bool SemanticCheck();
bool Emit(); // emit code
}
Your code generator needs to have a function that generates sequential labels for your assembler. A simple implementation is a function to return a string with a static int that increments, and returns LBL1, LBL2, LBL3, etc. Your labels can be symbols, or you might get fancy with a Label class, and use a constructor for new Labels:
class Label : public Symbol {
public Label() {
this.name = newLabel(); // incrementing LBL1, LBL2, LBL3
}
}
A loop is implemented by generating the code for condExpression, then the redoLabel, then the blockStatement, and at the end of blockStatement, then goto to redoLabel.
A sample from one of my compilers to generate code for the CLR.
// Generate code for .NET CLR for While statement
//
void WhileStatement::clr_emit(AST *ctx)
{
redoLabel = compiler->mkLabelSym();
startLabel = compiler->mkLabelSym();
endLabel = compiler->mkLabelSym();
// Emit the redo label which is the beginning of each loop
compiler->out("%s:\n", redoLabel->getName());
if(condExpr) {
condExpr->clr_emit_handle();
condExpr->clr_emit_fetch(this, t_bool);
// Test the condition, if false, branch to endLabel, else fall through
compiler->out("brfalse %s\n", endLabel->getName());
}
// The body of the loop
compiler->out("%s:\n", startLabel->getName()); // start label only for clarity
loopStmt->clr_emit(this); // generate code for the block
// End label, jump out of loop
compiler->out("br %s\n", redoLabel->getName()); // goto redoLabel
compiler->out("%s:\n", endLabel->getName()); // endLabel for goto out of loop
}
This answer explains that to validate an arbitrary regular expression, one simply uses eval:
while (<>) {
eval "qr/$_/;"
print $# ? "Not a valid regex: $#\n" : "That regex looks valid\n";
}
However, this strikes me as very unsafe, for what I hope are obvious reasons. Someone could input, say:
foo/; system('rm -rf /'); qr/
or whatever devious scheme they can devise.
The natural way to prevent such things is to escape special characters, but if I escape too many characters, I severely limit the usefulness of the regex in the first place. A strong argument can be made, I believe, that at least []{}()/-,.*?^$! and white space characters ought to be permitted (and probably others), un-escaped, in a user regex interface, for the regexes to have minimal usefulness.
Is it possible to secure myself from regex injection, without limiting the usefulness of the regex language?
The solution is simply to change
eval("qr/$_/")
to
eval("qr/\$_/")
This can be written more clearly as follows:
eval('qr/$_/')
But that's still not optimal. The following would be far better as it doesn't involve generating and compiling Perl code at run-time:
eval { qr/$_/ }
Note that neither solution protects you from denial of service attacks. It's quite easy to write a pattern that will take longer than the life of the universe to complete. To hand that situation, you could execute the regex match in a child for which CPU ulimit has been set.
There is some discussion about this over at The Monastery.
TLDR: use re::engine::RE2 (-strict => 1);
Make sure at add (-strict => 1) to your use statement or re::engine::RE2 will fall back to perl's re.
The following is a quote from junyer, owner of the project on github.
RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.
I'm in the process of refactoring a very large amount of code, mostly C++, to remove a number of temporary configuration checks which have become permanantly set to given values. So for example, I would have the following code:
#include <value1.h>
#include <value2.h>
#include <value3.h>
...
if ( value1() )
{
// do something
}
bool b = value2();
if ( b && anotherCondition )
{
// do more stuff
}
if ( value3() < 10 )
{
// more stuff again
}
where the calls to value return either a bool or an int. Since I know the values that these calls always return, I've done some regex substitution to expand the calls to their normal values:
// where:
// value1() == true
// value2() == false
// value3() == 4
// TODO: Remove expanded config (value1)
if ( true )
{
// do something
}
// TODO: Remove expanded config (value2)
bool b = false;
if ( b && anotherCondition )
{
// do more stuff
}
// TODO: Remove expanded config (value3)
if ( 4 < 10 )
{
// more stuff again
}
Note that although the values are fixed, they are not set at compile time but are read from shared memory so the compiler is not currently optimising anything away behind the scenes.
Although the resultant code looks a bit goofy, this regex approach achieves a lot of what I want since it's simple to apply and removes dependence on the calls, while not changing the behaviour of the code and it's also likely that the compiler may then optimise a lot of it out knowing that a block can never be called or a check will always return true. It also makes it reasonably easy (especially when diffing against version control) to see what has changed and take the final step of cleaning it up so the code above code eventually looks as follows:
// do something
// DONT do more stuff (b being false always prevented this)
// more stuff again
The trouble is that I have hundreds (possibly thousands) of changes to make to get from the second, correct but goofy, stage to get to the final cleaned code.
I wondered if anyone knew of a refactoring tool which might handle this or of any techniques I could apply. The main problem is that the C++ syntax makes full expansion or elimination quite difficult to achieve and there are many permutations to the code above. I feel I almost need a compiler to deal with the variation of syntax that I would need to cover.
I know there have been similar questions but I can't find any requirement quite like this and also wondered if any tools or procedures had emerged since they were asked?
It sounds like you have what I call "zombie code"... dead in practice, but still live as far as the compiler is concerned. This is a pretty common issue with most systems of organized runtime configuration variables: eventually some configuration variables arrive at a permanent fixed state, yet are reevaluated at runtime repeatedly.
The cure isn't regex, as you have noted, because regex doesn't parse C++ code reliably.
What you need is a program transformation system. This is a tool that really parses source code, and can apply a set of code-to-code rewriting rules to the parse tree, and can regenerate source text from the changed tree.
I understand that Clang has some capability here; it can parse C++ and build a tree, but it does not have source-to-source transformation capability. You can simulate that capability by writing AST-to-AST transformations but that's a lot more inconvenient IMHO. I believe it can regenerate C++ code but I don't know if it will preserve comments or preprocessor directives.
Our DMS Software Reengineering Toolkit with its C++(11) front end can (and has been used to) carry out massive transformations on C++ source code, and has source-to-source transformations. AFAIK, it is the only production tool that can do this. What you need is a set of transformations that represent your knowledge of the final state of the configuration variables of interest, and some straightforward code simplification rules. The following DMS rules are close to what you likely want:
rule fix_value1():expression->expression
"value1()" -> "true";
rule fix_value2():expression->expression
"value2()" -> "false";
rule fix_value3():expression->expression
"value3()" -> "4";
rule simplify_boolean_and_true(r:relation):condition->condition
"r && true" -> "r".
rule simplify_boolean_or_ture(r:relation):condition->condition
"r || true" -> "true".
rule simplify_boolean_and_false(r:relation):condition->condition
"r && false" -> "false".
...
rule simplify_boolean_not_true(r:relation):condition->condition
"!true" -> "false".
...
rule simplify_if_then_false(s:statement): statement->statement
" if (false) \s" -> ";";
rule simplify_if_then_true(s:statement): statement->statement
" if (true) \s" -> "\s";
rule simplify_if_then_else_false(s1:statement, s2:statement): statement->statement
" if (false) \s1 else \s2" -> "\s2";
rule simplify_if_then_else_true(s1:statement, s2: statement): statement->statement
" if (true) \s1 else \s2" -> "\s2";
You also need rules to simplify ("fold") constant expressions involving arithmetic, and rules to handle switch on expressions that are now constant. To see what DMS rules look like for integer constant folding see Algebra as a DMS domain.
Unlike regexes, DMS rewrite rules cannot "mismatch" code; they represent the corresponding ASTs and it is that ASTs that are matched. Because it is AST matching, they have no problems with whitespace, line breaks or comments. You might think they could have trouble with order of operands ('what if "false && x" is encountered?'); they do not, as the grammar rules for && and || are marked in the DMS C++ parser as associative and commutative and the matching process automatically takes that into account.
What these rules cannot do by themselves is value (in your case, constant) propagation across assignments. For this you need flow analysis so that you can trace such assignments ("reaching definitions"). Obviously, if you don't have such assignments or very few, you can hand patch those. If you do, you'll need the flow analysis; alas, DMS's C++ front isn't quite there but we are working on it; we have control flow analysis in place. (DMS's C front end has full flow analysis).
(EDIT February 2015: Now does full C++14; flow analysis within functions/methods).
We actually applied this technique to 1.5M SLOC application of mixed C and C++ code from IBM Tivoli almost a decade ago with excellent success; we didn't need the flow analysis :-}
You say:
Note that although the values are reasonably fixed, they are not set at compile time but are read from shared memory so the compiler is not currently optimising anything away behind the scenes.
Constant-folding the values by hand doesn't make a lot of sense unless they are completely fixed. If your compiler provides constexpr you could use that, or you could substitute in preprocessor macros like this:
#define value1() true
#define value2() false
#define value3() 4
The optimizer would take care of you from there. Without seeing examples of exactly what's in your <valueX.h> headers or knowing how your process of getting these values from shared memory is working, I'll just throw out that it could be useful to rename the existing valueX() functions and do a runtime check in case they change again in the future:
// call this at startup to make sure our agreed on values haven't changed
void check_values() {
assert(value1() == get_value1_from_shared_memory());
assert(value2() == get_value2_from_shared_memory());
assert(value3() == get_value3_from_shared_memory());
}
In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed:
Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.
I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.
A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:
>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
in
range (65, 90)
max_repeat 0 65535
in
range (65, 90)
range (48, 57)
at at_boundary
max_repeat 0 65535
not_literal 62
literal 62
subpattern 2
min_repeat 0 65535
any None
literal 60
literal 47
groupref 1
literal 62
The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.
Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:
^[A-Za-z][A-Za-z0-9_]*$
which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:
state = FIRSTCHAR
for char in all_chars_in(string):
if state == FIRSTCHAR:
if char is not in the set "A-Z" or "a-z":
error "Invalid first character"
state = SUBSEQUENTCHARS
next char
if state == SUBSEQUENTCHARS:
if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
error "Invalid subsequent character"
state = SUBSEQUENTCHARS
next char
Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.
That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.
The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.
Make your own on the fly!
http://osteele.com/tools/reanimator/???
This is a really nicely put together tool which visualises regular expressions as FSMs. It doesn't have support for some of the syntax you'll find in real-world regular expression engines, but certainly enough to understand exactly what's going on.
Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"
How do I choose the states and the transition conditions?
I usually use FSMs for fairly simple problems and choose them intuitively. In my answer to another question about regular expressions, I just looked at the parsing problem as one of being either Inside or outside a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).
How do I implement my abstract state machine in Foo?
If your implementation language supports a structure like c's switch statement, then you switch on the current state and process the input to see which action and/or transition too perform next.
Without switch-like structures, or if they are deficient in some way, you if style branching. Ugh.
Written all in one place in c the example I linked would look something like this:
token_t token;
state_t state=BEGIN_STATE;
do {
switch ( state.getValue() ) {
case BEGIN_STATE;
state=OUT_STATE;
break;
case OUT_STATE:
switch ( token.getValue() ) {
case CODE_TOKEN:
state = IN_STATE;
output(token.string());
break;
case NEWLINE_TOKEN;
output("<break>");
output(token.string());
break;
...
}
break;
...
}
} while (state != END_STATE);
which is pretty messy, so I usually rip the state cases out to separate functions.
I'm sure someone has better examples, but you could check this post by Phil Haack, which has an example of a regular expression and a state machine doing the same thing (there's a previous post with a few more regex examples in there as well I think..)
Check the "HenriFormatter" on that page.
I don't know what academic papers you've already read but it really isn't that difficult to understand how to implement a finite state machine. There are some interesting mathematics but to idea is actually very trivial to understand. The easiest way to understand an FSM is through input and output (actually, this comprises most of the formal definition, that I won't describe here). A "state" is essentially just describing a set of input and outputs that have occurred and can occur from a certain point.
Finite state machines are easiest to understand via diagrams. For example:
alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif
All this is saying is that if you begin in some state q0 (the one with the Start symbol next to it) you can go to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think of an finite state machine is in terms of "valid" or "acceptable" input. There are certain output strings that are NOT possible certain finite state machines; this would allow you to "match" expressions.
Now suppose you start at q0. Now, if you input a 0 you will go to state q1. However, if you input a 1 you will go to state q2. You can see this by the symbols above the input/output arrows.
Let's say you start at q0 and get this input
0, 1, 0, 1, 1, 1
This means you have gone through states (no input for q0, you just start there):
q0 -> q1 -> q0 -> q1 -> q0 -> q2 -> q3 -> q3
Trace the picture with your finger if it doesn't make sense. Notice that q3 goes back to itself for both inputs 0 and 1.
Another way to say all this is "If you are in state q0 and you see a 0 go to q1 but if you see a 1 go to q2." If you make these conditions for each state you are nearly done defining your state machine. All you have to do is have a state variable and then a way to pump input in and that is basically what is there.
Ok, so why is this important regarding Joel's statement? Well, building the "ONE TRUE REGULAR EXPRESSION TO RULE THEM ALL" can be very difficult and also difficult to maintain modify or even for others to come back and understand. Also, in some cases it is more efficient.
Of course, state machines have many other uses. Hope this helps in some small way. Note, I didn't bother going into the theory but there are some interesting proofs regarding FSMs.