complexity of parsing C++ - c++

Out of curiosity, I was wondering what were some "theoretical" results about parsing C++.
Let n be the size of my project (in LOC, for example, but since we'll deal with big-O it's not very important)
Is C++ parsed in O(n) ? If not, what's the complexity?
Is C (or Java or any simpler language in the sense of its grammar) parsed in O(n)?
Will C++1x introduce new features that will be even harder to parse?
References would be greatly appreciated!

I think the term "parsing" is being interpreted by different people in different ways for the purposes of the question.
In a narrow technical sense, parsing is merely verifying the the source code matches the grammar (or perhaps even building a tree).
There's a rather widespread folk theorem that says you cannot parse C++ (in this sense) at all because you must resolve the meaning of certain symbols to parse. That folk theorem is simply wrong.
It arises from the use of "weak" (LALR or backtracking recursive descent) parsers, which, if they commit to the wrong choice of several possible subparse of a locally ambiguous part of text (this SO thread discusses an example), fail completely by virtue of sometimes making that choice. The way those that use such parser resolve the dilemma is collect symbol table data as parsing occurs and mash extra checking into the parsing process to force the parser to make the right choice when such choice is encountered. This works at the cost of significantly tangling name and type resolution with parsing, which makes building such parsers really hard. But, at least for legacy GCC, they used LALR which is linear time on parsing and I don't think that much more expensive if you include the name/type capture that the parser does (there's more to do after parsing because I don't think they do it all).
There are at least two implementations of C++ parsers done using "pure" GLR parsing technology, which simply admits that the parse may be locally ambiguous and captures the multiple parses without comment or significant overhead. GLR parsers are linear time where there are no local ambiguities. They are more expensive in the ambiguity region, but as a practical matter, most the of source text in a standard C++ program falls into the "linear time" part. So the effective rate is linear, even capturing the ambiguities. Both of the implemented parsers do name and type resolution after parsing and use inconsistencies to eliminate the incorrect ambiguous parses.
(The two implementations are Elsa and our (SD's) C++ Front End. I can't speak for Elsa's current capability (I don't think it has been updated in years), but ours does all of C++11 [EDIT Jan 2015: now full C++14 EDIT Oct 2017: now full C++17] including GCC and Microsoft variants).
If you take the hard computer science definition that a language is extensionally defined as an arbitrary set of strings (Grammars are supposed to be succinct ways to encode that intensionally) and treating parsing as "check the the syntax of the program is correct" then for C++ you have expand the templates to verify that each actually expands completely. There's a Turing machine hiding in the templates, so in theory checking that a C++ program is valid is impossible (no time limits). Real compilers (honoring the standard) place fixed constraints on how much template unfolding they'll do, and so does real memory, so in practice C++ compilers finish. But they can take arbitrarily long to "parse" a program in this sense. And I think that's the answer most people care about.
As a practical matter, I'd guess most templates are actually pretty simple, so C++ compilers can finish as fast as other compilers on average. Only people crazy enough to write Turing machines in templates pay a serious price. (Opinion: the price is really the conceptual cost of shoehorning complicated things onto templates, not the compiler execution cost.)

Depends what you mean by "parsed", but if your parsing is supposed to include template instantiation, then not in general:
[Shortcut if you want to avoid reading the example - templates provide a rich enough computational model that instantiating them is, in general, a halting-style problem]
template<int N>
struct Triangle {
static const int value = Triangle<N-1>::value + N;
};
template<>
struct Triangle<0> {
static const int value = 0;
};
int main() {
return Triangle<127>::value;
}
Obviously, in this case the compiler could theoretically spot that triangle numbers have a simple generator function, and calculate the return value using that. But otherwise, instantiating Triangle<k> is going to take O(k) time, and clearly k can go up pretty quickly with the size of this program, as far as the limit of the int type.
[End of shortcut]
Now, in order to know whether Triangle<127>::value is an object or a type, the compiler in effect must instantiate Triangle<127> (again, maybe in this case it could take a shortcut since value is defined as an object in every template specialization, but not in general). Whether a symbol represents an object or a type is relevant to the grammar of C++, so I would probably argue that "parsing" C++ does require template instantiation. Other definitions might vary, though.
Actual implementations arbitrarily cap the depth of template instantiation, making a big-O analysis irrelevant, but I ignore that since in any case actual implementations have natural resource limits, also making big-O analysis irrelevant...
I expect you can produce similarly-difficult programs in C with recursive #include, although I'm not sure whether you intend to include the preprocessor as part of the parsing step.
Aside from that, C, in common with plenty of other languages, can have O(not very much) parsing. You may need symbol lookup and so on, which as David says in his answer cannot in general have a strict O(1) worst case bound, so O(n) might be a bit optimistic. Even an assembler might look up symbols for labels. But for example dynamic languages don't necessarily even need symbol lookup for parsing, since that might be done at runtime. If you pick a language where all the parser needs to do is establish which symbols are keywords, and do some kind of bracket-matching, then the Shunting Yard algorithm is O(n), so there's hope.

Hard to tell if C++ can be "just parsed", as - contrary to most languages - it cannot be analysed syntactically without performing semantic analysis at the same time.

Related

Why do C fundamental types have identifiers with multiple keywords

Apart from obvious answer because, guys designed it that way, why does C/C++ have types, which consist of multiple identifiers, e.g.
long long (int)
short int
signed char
A do have some basic knowledge of parsing and have used flex/bison tools to make few parsers and I think, that this bring much more complexity to parsing type names. And looking on C++ grammar in standard, everything about types really is complicated.
I know, that C++ (also C, I believe) do not specify much about sizes of fundamental data types, thus making types int_8, uint_8, etc. would not work (Altough c++11 gave us fixed width integers).
So, why did developers of standard agreed on multi-word type identifiers, when they could make int, uint and similar.
Speaking in terms of C, why did the developers of the standard agree on multi-word identifiers? It's because that was what the language had at the time of standardisation.
The mandate for the original standard was not to create a new language but to codify existing practice. As per the C89 standard itself:
The Committee evaluated many proposals for additions, deletions, and changes to the base documents during its deliberations. A concerted effort was made to codify existing practice wherever unambiguous and consistent practice could be identified. However, where no consistent practice could be identified, the Committee worked to establish clear rules that were consistent with the overall flavor of the language.
And, from the C99 rationale document:
The original X3J11 charter clearly mandated codifying common existing practice, and the C89 Committee held fast to precedent wherever that was clear and unambiguous. The vast majority of the language defined by C89 was precisely the same as defined in Appendix A of the first edition of The C Programming Language by Brian Kernighan and Dennis Ritchie, and as was implemented in almost all C translators of the time.
Beyond that, each iteration of the standard has valued backward compatibility highly so that code doesn't break. From that same rationale document:
Existing code is important, existing implementations are not. A large body of C code exists of considerable commercial value. Every attempt has been made to ensure that the bulk of this code will be acceptable to any implementation conforming to the Standard. The C89 Committee did not want to force most programmers to modify their C programs just to have them accepted by a conforming translator.
So, while later versions of the standard gave us things like stdint.h with its fixed width integral types, taking away the standard ones like int and long would be a gross violation of that guideline.
In terms of C++, it's almost certainly a holdover from the earliest days of that language where it was put forward as "C plus classes". In fact, the very early cfront C++ compiler was so named because it took C++ source code and turned that into C before giving it to a suitable C compiler (i.e., a front end for C, hence cfront).
This would have allowed the original author Bjarne to minimise the workload in delivering C++ since the bulk of it was already provided by the C compiler itself.
In terms of parsing a language, it's certainly more difficult to have to process unsigned long int x (a) than it is to handle ulong x.
But, given that the compiler already has to handle a large number of optional "modifiers/specifiers" for a variable (e.g., const char * const x), handling a few others is par for the course.
(a) Or int long unsigned x or long unsigned x or any of the other type specifiers that end up becoming the singular unsigned long int type. See here for more details.
Adding new reserved words to a language will break any code which happens to use such words as identifiers unless those words are of a form which is reserved for future expansion (e.g. contain two leading underscores, or start with an underscore and a capital letter, etc.)
By contrast, if some particular sequence of reserved words has no defined meaning in any existing implementation, there can be no existing code which uses that sequence of reserved words, and thus no danger of breaking existing code by attaching a new meaning to it.

Writing a Parser for a programming language: Output [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to write a simple interpreted programming language in C++. I've read that a lot of people use tools such Lex/Flex Bison to avoid "reinventing the wheel", but since my goal is to understand how these little beasts work improving my knowledge, i've decided to write the Lexer and the Parser from scratch. At the moment i'm working on the parser (the lexer is complete) and i was asking myself what should be its output. A tree? A linear vector of statements with a "depth" or "shift" parameter? How should i manage loops and if statements? Should i replace them with invisible goto statements?
A parser should almost always output an AST. An AST is simply, in the broadest sense, a tree representation of the syntactical structure of the program. A Function becomes an AST node containing the AST of the function body. An if becomes an AST node containing the AST of the condition and the body. A use of an operator becomes an AST node containing the AST of each operand. Integer literals, variable names, and so on become leaf AST nodes. Operator precedence and such is implicit in the relationship of the nodes: Both 1 * 2 + 3 and (1 * 2) + 3 are represented as Add(Mul(Int(1), Int(2)), Int(3)).
Many details of what's in the AST depend on your language (obviously) and what you want to do with the tree. If you want to analyze and transform the program (i.e. split out altered source code at the end), you might preserve comments. If you want detailed error messages, you might add source locations (as in, this integer literal was on line 5 column 12).
A compiler will proceed to turn the AST into a different format (e.g. a linear IR with gotos, or data flow graphs). Going through the AST is still a good idea, because a well-designed AST has a good balance of being syntax-oriented but only storing what's important for understanding the program. The parser can focus on parsing while the later transformations are protected from irrelevant details such as the amount of white space and operator precedence. Note that such a "compiler" might also output bytecode that's later interpreted (the reference implementation of Python does this).
A relatively pure interpreter might instead interpret the AST. Much has been written about this; it is about the easiest way to execute the parser's output. This strategy benefits from the AST in much the same way as a compiler; in particular most interpretation is simply top-down traversal of the AST.
The formal and most properly correct answer is going to be that you should return an Abstract Syntax Tree. But that is simultaneously the tip of an iceberg and no answer at all.
An AST is simply a structure of nodes describing the parse; a visualization of the paths your parse took thru the token/state machine.
Each node represents a path or description. For example, you would have nodes which represents language statements, nodes which represent compiler directives and nodes which represent data.
Consider a node which describes a variable, and lets say your language supports variables of int and string and the notion of "const". You may well choose to make the type a direct property of the Variable node struct/class, but typically in an AST you make properties - like constness - a "mutator", which is itself some form of node linked to the Variable node.
You could implement the C++ concept of "scope" by having locally-scoped variables as mutations of a BlockStatement node; the constraints of a "Loop" node (for, do, while, etc) as mutators.
When you closely tie your parser/tokenizer to your language implementation, it can become a nightmare making even small changes.
While this is true, if you actually want to understand how these things work, it is worth going through at least one first implementation where you begin to implement your runtime system (vm, interpreter, etc) and have your parser target it directly. (The alternative is, e.g., to buy a copy of the "Dragon Book" and read how it's supposed to be done, but it sounds like you are actually wanting to have the full understanding that comes from having worked thru the problem yourself).
The trouble with being told to return an AST is that an AST actually needs a form of parsing.
struct Node
{
enum class Type {
Variable,
Condition,
Statement,
Mutator,
};
Node* m_parent;
Node* m_next;
Node* m_child;
Type m_type;
string m_file;
size_t m_lineNo;
};
struct VariableMutatorNode : public Node
{
enum class Mutation {
Const
};
Mutation m_mutation;
// ...
};
struct VariableNode
{
VariableMutatorNode* m_mutators;
// ...
};
Node* ast; // Top level node in the AST.
This sort of AST is probably OK for a compiler that is independent of its runtime, but you'd need to tighten it up a lot for a complex, performance sensitive language down the (at which point there is less 'A' in 'AST').
The way you walk this tree is to start with the first node of 'ast' and act acording to it. If you're writing in C++, you can do this by attaching behaviors to each node type. But again, that's not so "abstract", is it?
Alternatively, you have to write something which works its way thru the tree.
switch (node->m_type) {
case Node::Type::Variable:
declareVariable(node);
break;
case Node::Type::Condition:
evaluate(node);
break;
case Node::Type::Statement:
execute(node);
break;
}
And as you write this, you'll find yourself thinking "wait, why didn't the parser do this for me?" because processing an AST often feels a lot like you did a crap job of implementing the AST :)
There are times when you can skip the AST and go straight to some form of final representation, and (rare) times when that is desirable; then there are times when you could go straight to some form of final representation but now you have to change the language and that decision will cost you a lot of reimplementation and headaches.
This is also generally the meat of building your compiler - the lexer and parser are generally the lesser parts of such an under taking. Working with the abstract/post-parse representation is a much more significant part of the work.
That's why people often go straight to flex/bison or antlr or some such.
And if that's what you want to do, looking at .NET or LLVM/Clang can be a good option, but you can also fairly easily bootstrap yourself with something like this: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/
Best of luck :)
I would build a tree of statements. After that, yes the goto statements are how the majority of it works (jumps and calls). Are you translating to a low level like assembly?
The output of the parser should be an abstract syntax tree, unless you know enough about writing compilers to directly produce byte-code, if that's your target language. It can be done in one pass but you need to know what you're doing. The AST expresses loops and ifs directly: you're not concerned with translating them yet. That comes under code generation.
People don't use lex/yacc to avoid re-inventing the wheel, the use it to build a more robust compiler prototype more quickly, with less effort, and to focus on the language, and avoid getting bogged down in other details. From personal experience with several VM projects, compilers and assemblers, I suggest if you want to learn how to build a language, do just that -- focus on building a language (first).
Don't get distracted with:
Writing your own VM or runtime
Writing your own parser generator
Writing your own intermediate language or assembler
You can do these later.
This is a common thing I see when a bright young computer scientist first catches the "language fever" (and its good thing to catch), but you need to be careful and focus your energy on the one thing you want to do well, and make use of other robust, mature technologies like parser generators, lexers, and runtime platforms. You can always circle back later, when you have slain the compiler dragon first.
Just spend your energy learning how a LALR grammar works, write your language grammar in Bison or Yacc++ if you can still find it, don't get distracted by people who say you should be using ANTLR or whatever else, that isn't the goal early on. Early on, you need to focus on crafting your language, removing ambiguities, creating a proper AST (maybe the most important skillset), semantic checking, symbol resolution, type resolution, type inference, implicit casting, tree rewriting, and of course, end program generation. There is enough to be done making a proper language that you don't need to be learning multiple other areas of research that some people spend their whole careers mastering.
I recommend you target an existing runtime like the CLR (.NET). It is one of the best runtimes for crafting a hobby language. Get your project off the ground using a textual output to IL, and assemble with ilasm. ilasm is relatively easy to debug, assuming you put some time into learning it. Once you get a prototype going, you can then start thinking about other things like an alternate output to your own interpreter, in case you have language features that are too dynamic for the CLR (then look at the DLR). The main point here is that CLR provides a good intermediate representation to output to. Don't listen to anyone that tells you you should be directly outputting bytecode. Text is king for learning in the early stages and allows you to plug and play with different languages / tools. A good book is by the author John Gough, titled Compiling for the .NET Common Language Runtime (CLR) and he takes you through the implementation of the Gardens Point Pascal Compiler, but it isn't a book about Pascal, it is a book about how to build a real compiler on the CLR. It will answer many of your questions on implementing loops and other high level constructs.
Related to this, a great tool for learning is to use Visual Studio and ildasm (the disassembler) and .NET Reflector. All available for free. You can write small code samples, compile them, then disassemble them to see how they map to a stack based IL.
If you aren't interested in the CLR for whatever reason, there are other options out there. You will probably run across llvm, Mono, NekoVM, and Parrot (all good things to learn) in your searches. I was an original Parrot VM / Perl 6 developer, and wrote the Perl Intermediate Representation language and imcc compiler (which is quite a terrible piece of code I might add) and the first prototype Perl 6 compiler. I suggest you stay away from Parrot and stick with something easier like .NET CLR, you'll get much further. If, however, you want to build a real dynamic language, and want to use Parrot for its continuations and other dynamic features, see the O'Reilly Books Perl and Parrot Essentials (there are several editions), the chapters on PIR/IMCC are about my stuff, and are useful. If your language isn't dynamic, then stay far away from Parrot.
If you are bent on writing your own VM, let me suggest you prototype the VM in Perl, Python or Ruby. I have done this a couple of times with success. It allows you to avoid too much implementation early, until your language starts to mature. Perl+Regex are easy to tweak. An intermediate language assembler in Perl or Python takes a few days to write. Later, you can rewrite the 2nd version in C++ if you still feel like it.
All this I can sum up with: avoid premature optimizations, and avoid trying to do everything at once.
First you need to get a good book. So I refer you to the book by John Gough in my other answer, but emphasize, focus on learning to implement an AST for a single, existing platform first. It will help you learn about AST implementation.
How to implement a loop?
Your language parser should return a tree node during the reduce step for the WHILE statement. You might name your AST class WhileStatement, and the WhileStatement has, as members, ConditionExpression and BlockStatement and several labels (also inheritable but I added inline for clarity).
Grammar pseudocode below, shows the how the reduce creates a new object of WhileStatement from a typical shift-reduce parser reduction.
How does a shift-reduce parser work?
WHILE ( ConditionExpression )
BlockStatement
{
$$ = new WhileStatement($3, $5);
statementList.Add($$); // this is your statement list (AST nodes), not the parse stack
}
;
As your parser sees "WHILE", it shifts the token on the stack. And so forth.
parseStack.push(WHILE);
parseStack.push('(');
parseStack.push(ConditionalExpression);
parseStack.push(')');
parseStack.push(BlockStatement);
The instance of WhileStatement is a node in a linear statement list. So behind the scenes, the "$$ =" represents a parse reduce (though if you want to be pedantic, $$ = ... is user-code, and the parser is doing its own reductions implicitly, regardless). The reduce can be thought of as popping off the tokens on the right side of the production, and replacing with the single token on the left side, reducing the stack:
// shift-reduce
parseStack.pop_n(5); // pop off the top 5 tokens ($1 = WHILE, $2 = (, $3 = ConditionExpression, etc.)
parseStack.push(currToken); // replace with the current $$ token
You still need to add your own code to add statements to a linked list, with something like "statements.add(whileStatement)" so you can traverse this later. The parser has no such data structure, and its stacks are only transient.
During parse, synthesize a WhileStatement instance with its appropriate members.
In latter phase, implement the visitor pattern to visit each statement and resolve symbols and generate code. So a while loop might be implemented with the following AST C++ class:
class WhileStatement : public CompoundStatement {
public:
ConditionExpression * condExpression; // this is the conditional check
Label * startLabel; // Label can simply be a Symbol
Label * redoLabel; // Label can simply be a Symbol
Label * endLabel; // Label can simply be a Symbol
BlockStatement * loopStatement; // this is the loop code
bool ResolveSymbolsAndTypes();
bool SemanticCheck();
bool Emit(); // emit code
}
Your code generator needs to have a function that generates sequential labels for your assembler. A simple implementation is a function to return a string with a static int that increments, and returns LBL1, LBL2, LBL3, etc. Your labels can be symbols, or you might get fancy with a Label class, and use a constructor for new Labels:
class Label : public Symbol {
public Label() {
this.name = newLabel(); // incrementing LBL1, LBL2, LBL3
}
}
A loop is implemented by generating the code for condExpression, then the redoLabel, then the blockStatement, and at the end of blockStatement, then goto to redoLabel.
A sample from one of my compilers to generate code for the CLR.
// Generate code for .NET CLR for While statement
//
void WhileStatement::clr_emit(AST *ctx)
{
redoLabel = compiler->mkLabelSym();
startLabel = compiler->mkLabelSym();
endLabel = compiler->mkLabelSym();
// Emit the redo label which is the beginning of each loop
compiler->out("%s:\n", redoLabel->getName());
if(condExpr) {
condExpr->clr_emit_handle();
condExpr->clr_emit_fetch(this, t_bool);
// Test the condition, if false, branch to endLabel, else fall through
compiler->out("brfalse %s\n", endLabel->getName());
}
// The body of the loop
compiler->out("%s:\n", startLabel->getName()); // start label only for clarity
loopStmt->clr_emit(this); // generate code for the block
// End label, jump out of loop
compiler->out("br %s\n", redoLabel->getName()); // goto redoLabel
compiler->out("%s:\n", endLabel->getName()); // endLabel for goto out of loop
}

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

Where does the k prefix for constants come from?

it's a pretty common practice that constants are prefixed with k (e.g. k_pi). But what does the k mean?
Is it simply that c already meant char?
It's a historical oddity, still common practice among teams who like to blindly apply coding standards that they don't understand.
Long ago, most commercial programming languages were weakly typed; automatic type checking, which we take for granted now, was still mostly an academic topic. This meant that is was easy to write code with category errors; it would compile and run, but go wrong in ways that were hard to diagnose. To reduce these errors, a chap called Simonyi suggested that you begin each variable name with a tag to indicate its (conceptual) type, making it easier to spot when they were misused. Since he was Hungarian, the practise became known as "Hungarian notation".
Some time later, as typed languages (particularly C) became more popular, some idiots heard that this was a good idea, but didn't understand its purpose. They proposed adding redundant tags to each variable, to indicate its declared type. The only use for them is to make it easier to check the type of a variable; unless someone has changed the type and forgotten to update the tag, in which case they are actively harmful.
The second (useless) form was easier to describe and enforce, so it was blindly adopted by many, many teams; decades later, you still see it used, and even advocated, from time to time.
"c" was the tag for type "char", so it couldn't also be used for "const"; so "k" was chosen, since that's the first letter of "konstant" in German, and is widely used for constants in mathematics.
I haven't seen it that much, but maybe it comes from certain languages' (the germanic ones in particular) spelling of the word constant - konstant.
Don't use Hungarian Notation. If you want constants to stand out, make them all caps.
As a side note: there are a lot of things in the Google Coding Standards that are poor practice (in terms of code readability). That is what happens when you design a coding standard by committee.
It means the value is k-onstant.
I think mathematical convention was the precedent. k is used in maths all the time as just some constant.
K stands for konstant, a wordplay on constant. It relates to Coding Styles.
It's just a matter of preference, some people and projects use them which means they also embrace the Hungarian notation, many don't. That's not that important.
If you're unsure what a prefix or style might mean, always check if the project has a coding style reference and read that.
Actually, whenever I define constants in typescript, I do something like this -
NODE_ENV = 'production';
But recently, I saw that the k prefix is being used in the Flutter SDK. It makes sense to me to keep using the k prefix cuz' it helps your editor/IDE in searching out constants in your codebase.
It's a convention, probably from math. But there are other suggestions for constant too, for example Kernighan and Ritchie in their book "The C language" suggest writing constants' name in capital letters (e.g. #define MAX 55).
I think, it means coefficient (as k in math means)

Is there a 'catch' with FastFormat?

I just read about the FastFormat C++ i/o formatting library, and it seems too good to be true: Faster even than printf, typesafe, and with what I consider a pleasing interface:
// prints: "This formats the remaining arguments based on their order - in this case we put 1 before zero, followed by 1 again"
fastformat::fmt(std::cout, "This formats the remaining arguments based on their order - in this case we put {1} before {0}, followed by {1} again", "zero", 1);
// prints: "This writes each argument in the order, so first zero followed by 1"
fastformat::write(std::cout, "This writes each argument in the order, so first ", "zero", " followed by ", 1);
This looks almost too good to be true. Is there a catch? Have you had good, bad or indifferent experiences with it?
Is there a 'catch' with FastFormat?
Last time I checked, there was one annoying catch:
You can only use either the narrow string version or the wide string version of this library. (The functions for wchar_t and char are the same -- which type is used is a compile time switch.)
With iostreams, stdio or Boost.Format you can use both.
Found one "catch", though for most people it will never manifest. From the project page:
Atomic operation. It doesn't write out statement elements one at a time, like the IOStreams, so has no atomicity issues
The only way I can see this happening is if it buffers the whole write() call's output itself, then writes it out to the ostream in one step. This means it needs to allocate memory, and if an object passed into the write() call produces a lot of output (several megabytes or more), it can consume up to twice that much memory in internal buffers (assuming it uses the grow-a-buffer-by-doubling-its-size-each-time trick).
If you're just using it for logging, and not, say, dumping huge amounts of XML, you'll never see this problem.
The only other "catch" I'm seeing is:
Highly portable. It will work with all good modern C++ compilers; it even works with Visual C++ 6!
So it won't work with an old C++ compiler, like cfront, whereas iostreams is backward compatible to the late 80's. Again, I'd be surprised if anyone ever had a problem with this.
Although FastFormat is a good library there are a number of issues with it:
Limited formatting support, in particular the following features are not supported:
Leading zeros (or any other non-space padding)
Octal/hexadecimal encoding
Runtime width/alignment specification
The library is quite big for a relatively small task of formatting and has even bigger dependency (STLSoft).
It looks pretty interesting indeed! Good tip regardless, and +1 for that!
I've been playing with it for a bit. The main drawback I see is that FastFormat supports less formatting options for the output. This is I think a direct consequence of the way the higher typesafety is achieved, and a good tradeoff depending on your circumstances.
If you look in detail at his performance benchmark page, you'll notice that good old C printf-family functions are still winning on Linux. In fact, the only test case where they perform poorly is the test case that should be static string concatenations, where I would expect printf to be wasteful. Moreover, GCC provides static type-checking on printf-style function calls, so the benefit of type-safety is reduced. So: if you are running on Linux and if you need the absolute best performance, FastFormat is probably not the optimal solution.
The library depends on a couple of environment variables, as mentioned in the docs.
That might be no biggie to some people, but I'd prefer my code to be as self-contained as possible. If I check it out from source control, it should work and compile. It won't, if it requires you to set environment variables.