Embedding other language in Flex/Bison

Embedding other language in Flex/Bison - c++

The bottom line:
If you would like to add one, very small feature into C++ using Flex/Bison, how would you do that? For example, ability to declare void xxx() functions with syntax: foo%%: xxx?
The whole story:
Once I have coded a custom shader processing program that built ready-to-use Cg/GLSL pixel and vertex shaders from a number of blocks. I've added few features, mostly related to static compilation (something like a "better preprocessor").
For example, something that would look like
#if LIGHT_TYPE == POINT
float lightStrength = dot(N, normalize(pos - lightPos));
#elif LIGHT_TYPE == DIRECTIONAL
float lightStrength = dot(N, lightDir);
#endif
with pure macros, looks like
[LightType = Point]
[Require: pos, lightPos]
float LightStrength()
{
!let %N Normal
return dot(%N, normalize(pos - lightPos));
}
in my "language". As you can see, "functions" can be provided for variable light/material types. There is also a possibility to "call" other function and to mark what uniform/varying attributes are required for specific shader.
Why all this effort? Because (especially in early cards like SM 2.0) there were numerous limitations of attributes, and my "compiler" produced ready-to-use shader with list of required attributes/variables, managed parameters between pixel and vertex shaders, optimized some static stuff (just for readability, as Cg compiler would optimize it later anyway).
OK, but I am not writing all of this to praise myself or something.
I've written this "compiler" in C# and it initially was a very small program. But as time passed, many features were added and now the program is a complete mess with no options of refactoring. Also, being coded in C# means I cannot embed this compiler directly in C++ game, forcing me to spawn processes to compile shaders (this takes a lot of time).
I would like to rewrite my language/compiler using C++ and Flex/Bison toolkit. I've already coded efficient math parser in Flex/Bison so I am somewhat experienced in this matter. However there is one thing that I cannot resolve myself and this is a very topic of my question.
How can I possibly embed GLSL into my language? In C# compiler, I've just went line-by-line and checked if line starts with special characters (like % or [) and later did many tricks&hacks using string_replace and/or regexes.
Now I would like to write down a clean Bison grammar and do it properly. But including whole syntax of GLSL scares me. It is quite a big and complicated language, constantly evolving. And most I would do would be to pass-through all this GLSL code anyway.
If you are not experienced with GLSL/Cg/graphics programming at all, this is not quite important. The question can be rewritten into the "bottom line".
So, if you would like to add one, very small feature into C++ using Flex/Bison, how would you do that? For example, ability to declare void xxx() functions with syntax: foo%%: xxx?

I added multithreading mechanisms to pascal for a school project once, so I've had to deal with this. What I did was I found a BNF definition of pascal and copied that, mainly making my tokens equal to the text in 99% of the cases, and adding the new intermediate language code in the 1% of new tokens I added. Pascal is simple, so the BNF definition is relatively simple. C++, not so much.
The C++ Programming Language by Stroustroup has the language grammar that is practically parser-ready, but that's a lot of code to copy by hand. Maybe a certain website can help you find a bison/lex set for "standard" C++, and then you can modify that.
EDIT:
I found my yacc files, so here's a small example from my pascal code
procheader: PROCEDURE ID paramslist ';'{thread::inLocalDecl=true; $$="procedure "+$2+$3+";\n";}
| FUNCTION ID paramslist ':' ID ';'{thread::inLocalDecl=true; $$="function "+$2+$3+":"+$5+";\n";}
| THREAD {thread::inThreadDecl=true;
thread::inArgDecl=true;}
ID {thread::curName=$3;} paramslist { thread::inArgDecl=false;} ';'
{
/*a lot of code to transform this custom construct into standard pascal*/
}
the first two elements were just standard pascal: I copied the input tokens into the output strings verbatim (ie I didn't actually modify anything from the input file).
the third element(my THREAD keyword) was, obviously, not standard pascal. So I transformed the output into something that I could actually compile in standard pascal.
Basically to compile my "threaded" pascal, I had to take my source file, pass it through my parser, then compile the output.

Related

Writing a Parser for a programming language: Output [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to write a simple interpreted programming language in C++. I've read that a lot of people use tools such Lex/Flex Bison to avoid "reinventing the wheel", but since my goal is to understand how these little beasts work improving my knowledge, i've decided to write the Lexer and the Parser from scratch. At the moment i'm working on the parser (the lexer is complete) and i was asking myself what should be its output. A tree? A linear vector of statements with a "depth" or "shift" parameter? How should i manage loops and if statements? Should i replace them with invisible goto statements?

A parser should almost always output an AST. An AST is simply, in the broadest sense, a tree representation of the syntactical structure of the program. A Function becomes an AST node containing the AST of the function body. An if becomes an AST node containing the AST of the condition and the body. A use of an operator becomes an AST node containing the AST of each operand. Integer literals, variable names, and so on become leaf AST nodes. Operator precedence and such is implicit in the relationship of the nodes: Both 1 * 2 + 3 and (1 * 2) + 3 are represented as Add(Mul(Int(1), Int(2)), Int(3)).
Many details of what's in the AST depend on your language (obviously) and what you want to do with the tree. If you want to analyze and transform the program (i.e. split out altered source code at the end), you might preserve comments. If you want detailed error messages, you might add source locations (as in, this integer literal was on line 5 column 12).
A compiler will proceed to turn the AST into a different format (e.g. a linear IR with gotos, or data flow graphs). Going through the AST is still a good idea, because a well-designed AST has a good balance of being syntax-oriented but only storing what's important for understanding the program. The parser can focus on parsing while the later transformations are protected from irrelevant details such as the amount of white space and operator precedence. Note that such a "compiler" might also output bytecode that's later interpreted (the reference implementation of Python does this).
A relatively pure interpreter might instead interpret the AST. Much has been written about this; it is about the easiest way to execute the parser's output. This strategy benefits from the AST in much the same way as a compiler; in particular most interpretation is simply top-down traversal of the AST.

The formal and most properly correct answer is going to be that you should return an Abstract Syntax Tree. But that is simultaneously the tip of an iceberg and no answer at all.
An AST is simply a structure of nodes describing the parse; a visualization of the paths your parse took thru the token/state machine.
Each node represents a path or description. For example, you would have nodes which represents language statements, nodes which represent compiler directives and nodes which represent data.
Consider a node which describes a variable, and lets say your language supports variables of int and string and the notion of "const". You may well choose to make the type a direct property of the Variable node struct/class, but typically in an AST you make properties - like constness - a "mutator", which is itself some form of node linked to the Variable node.
You could implement the C++ concept of "scope" by having locally-scoped variables as mutations of a BlockStatement node; the constraints of a "Loop" node (for, do, while, etc) as mutators.
When you closely tie your parser/tokenizer to your language implementation, it can become a nightmare making even small changes.
While this is true, if you actually want to understand how these things work, it is worth going through at least one first implementation where you begin to implement your runtime system (vm, interpreter, etc) and have your parser target it directly. (The alternative is, e.g., to buy a copy of the "Dragon Book" and read how it's supposed to be done, but it sounds like you are actually wanting to have the full understanding that comes from having worked thru the problem yourself).
The trouble with being told to return an AST is that an AST actually needs a form of parsing.
struct Node
{
enum class Type {
Variable,
Condition,
Statement,
Mutator,
};
Node* m_parent;
Node* m_next;
Node* m_child;
Type m_type;
string m_file;
size_t m_lineNo;
};
struct VariableMutatorNode : public Node
{
enum class Mutation {
Const
};
Mutation m_mutation;
// ...
};
struct VariableNode
{
VariableMutatorNode* m_mutators;
// ...
};
Node* ast; // Top level node in the AST.
This sort of AST is probably OK for a compiler that is independent of its runtime, but you'd need to tighten it up a lot for a complex, performance sensitive language down the (at which point there is less 'A' in 'AST').
The way you walk this tree is to start with the first node of 'ast' and act acording to it. If you're writing in C++, you can do this by attaching behaviors to each node type. But again, that's not so "abstract", is it?
Alternatively, you have to write something which works its way thru the tree.
switch (node->m_type) {
case Node::Type::Variable:
declareVariable(node);
break;
case Node::Type::Condition:
evaluate(node);
break;
case Node::Type::Statement:
execute(node);
break;
}
And as you write this, you'll find yourself thinking "wait, why didn't the parser do this for me?" because processing an AST often feels a lot like you did a crap job of implementing the AST :)
There are times when you can skip the AST and go straight to some form of final representation, and (rare) times when that is desirable; then there are times when you could go straight to some form of final representation but now you have to change the language and that decision will cost you a lot of reimplementation and headaches.
This is also generally the meat of building your compiler - the lexer and parser are generally the lesser parts of such an under taking. Working with the abstract/post-parse representation is a much more significant part of the work.
That's why people often go straight to flex/bison or antlr or some such.
And if that's what you want to do, looking at .NET or LLVM/Clang can be a good option, but you can also fairly easily bootstrap yourself with something like this: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/
Best of luck :)

I would build a tree of statements. After that, yes the goto statements are how the majority of it works (jumps and calls). Are you translating to a low level like assembly?

The output of the parser should be an abstract syntax tree, unless you know enough about writing compilers to directly produce byte-code, if that's your target language. It can be done in one pass but you need to know what you're doing. The AST expresses loops and ifs directly: you're not concerned with translating them yet. That comes under code generation.

People don't use lex/yacc to avoid re-inventing the wheel, the use it to build a more robust compiler prototype more quickly, with less effort, and to focus on the language, and avoid getting bogged down in other details. From personal experience with several VM projects, compilers and assemblers, I suggest if you want to learn how to build a language, do just that -- focus on building a language (first).
Don't get distracted with:
Writing your own VM or runtime
Writing your own parser generator
Writing your own intermediate language or assembler
You can do these later.
This is a common thing I see when a bright young computer scientist first catches the "language fever" (and its good thing to catch), but you need to be careful and focus your energy on the one thing you want to do well, and make use of other robust, mature technologies like parser generators, lexers, and runtime platforms. You can always circle back later, when you have slain the compiler dragon first.
Just spend your energy learning how a LALR grammar works, write your language grammar in Bison or Yacc++ if you can still find it, don't get distracted by people who say you should be using ANTLR or whatever else, that isn't the goal early on. Early on, you need to focus on crafting your language, removing ambiguities, creating a proper AST (maybe the most important skillset), semantic checking, symbol resolution, type resolution, type inference, implicit casting, tree rewriting, and of course, end program generation. There is enough to be done making a proper language that you don't need to be learning multiple other areas of research that some people spend their whole careers mastering.
I recommend you target an existing runtime like the CLR (.NET). It is one of the best runtimes for crafting a hobby language. Get your project off the ground using a textual output to IL, and assemble with ilasm. ilasm is relatively easy to debug, assuming you put some time into learning it. Once you get a prototype going, you can then start thinking about other things like an alternate output to your own interpreter, in case you have language features that are too dynamic for the CLR (then look at the DLR). The main point here is that CLR provides a good intermediate representation to output to. Don't listen to anyone that tells you you should be directly outputting bytecode. Text is king for learning in the early stages and allows you to plug and play with different languages / tools. A good book is by the author John Gough, titled Compiling for the .NET Common Language Runtime (CLR) and he takes you through the implementation of the Gardens Point Pascal Compiler, but it isn't a book about Pascal, it is a book about how to build a real compiler on the CLR. It will answer many of your questions on implementing loops and other high level constructs.
Related to this, a great tool for learning is to use Visual Studio and ildasm (the disassembler) and .NET Reflector. All available for free. You can write small code samples, compile them, then disassemble them to see how they map to a stack based IL.
If you aren't interested in the CLR for whatever reason, there are other options out there. You will probably run across llvm, Mono, NekoVM, and Parrot (all good things to learn) in your searches. I was an original Parrot VM / Perl 6 developer, and wrote the Perl Intermediate Representation language and imcc compiler (which is quite a terrible piece of code I might add) and the first prototype Perl 6 compiler. I suggest you stay away from Parrot and stick with something easier like .NET CLR, you'll get much further. If, however, you want to build a real dynamic language, and want to use Parrot for its continuations and other dynamic features, see the O'Reilly Books Perl and Parrot Essentials (there are several editions), the chapters on PIR/IMCC are about my stuff, and are useful. If your language isn't dynamic, then stay far away from Parrot.
If you are bent on writing your own VM, let me suggest you prototype the VM in Perl, Python or Ruby. I have done this a couple of times with success. It allows you to avoid too much implementation early, until your language starts to mature. Perl+Regex are easy to tweak. An intermediate language assembler in Perl or Python takes a few days to write. Later, you can rewrite the 2nd version in C++ if you still feel like it.
All this I can sum up with: avoid premature optimizations, and avoid trying to do everything at once.

First you need to get a good book. So I refer you to the book by John Gough in my other answer, but emphasize, focus on learning to implement an AST for a single, existing platform first. It will help you learn about AST implementation.
How to implement a loop?
Your language parser should return a tree node during the reduce step for the WHILE statement. You might name your AST class WhileStatement, and the WhileStatement has, as members, ConditionExpression and BlockStatement and several labels (also inheritable but I added inline for clarity).
Grammar pseudocode below, shows the how the reduce creates a new object of WhileStatement from a typical shift-reduce parser reduction.
How does a shift-reduce parser work?
WHILE ( ConditionExpression )
BlockStatement
{
$$ = new WhileStatement($3, $5);
statementList.Add($$); // this is your statement list (AST nodes), not the parse stack
}
;
As your parser sees "WHILE", it shifts the token on the stack. And so forth.
parseStack.push(WHILE);
parseStack.push('(');
parseStack.push(ConditionalExpression);
parseStack.push(')');
parseStack.push(BlockStatement);
The instance of WhileStatement is a node in a linear statement list. So behind the scenes, the "$$ =" represents a parse reduce (though if you want to be pedantic, $$ = ... is user-code, and the parser is doing its own reductions implicitly, regardless). The reduce can be thought of as popping off the tokens on the right side of the production, and replacing with the single token on the left side, reducing the stack:
// shift-reduce
parseStack.pop_n(5); // pop off the top 5 tokens ($1 = WHILE, $2 = (, $3 = ConditionExpression, etc.)
parseStack.push(currToken); // replace with the current $$ token
You still need to add your own code to add statements to a linked list, with something like "statements.add(whileStatement)" so you can traverse this later. The parser has no such data structure, and its stacks are only transient.
During parse, synthesize a WhileStatement instance with its appropriate members.
In latter phase, implement the visitor pattern to visit each statement and resolve symbols and generate code. So a while loop might be implemented with the following AST C++ class:
class WhileStatement : public CompoundStatement {
public:
ConditionExpression * condExpression; // this is the conditional check
Label * startLabel; // Label can simply be a Symbol
Label * redoLabel; // Label can simply be a Symbol
Label * endLabel; // Label can simply be a Symbol
BlockStatement * loopStatement; // this is the loop code
bool ResolveSymbolsAndTypes();
bool SemanticCheck();
bool Emit(); // emit code
}
Your code generator needs to have a function that generates sequential labels for your assembler. A simple implementation is a function to return a string with a static int that increments, and returns LBL1, LBL2, LBL3, etc. Your labels can be symbols, or you might get fancy with a Label class, and use a constructor for new Labels:
class Label : public Symbol {
public Label() {
this.name = newLabel(); // incrementing LBL1, LBL2, LBL3
}
}
A loop is implemented by generating the code for condExpression, then the redoLabel, then the blockStatement, and at the end of blockStatement, then goto to redoLabel.
A sample from one of my compilers to generate code for the CLR.
// Generate code for .NET CLR for While statement
//
void WhileStatement::clr_emit(AST *ctx)
{
redoLabel = compiler->mkLabelSym();
startLabel = compiler->mkLabelSym();
endLabel = compiler->mkLabelSym();
// Emit the redo label which is the beginning of each loop
compiler->out("%s:\n", redoLabel->getName());
if(condExpr) {
condExpr->clr_emit_handle();
condExpr->clr_emit_fetch(this, t_bool);
// Test the condition, if false, branch to endLabel, else fall through
compiler->out("brfalse %s\n", endLabel->getName());
}
// The body of the loop
compiler->out("%s:\n", startLabel->getName()); // start label only for clarity
loopStmt->clr_emit(this); // generate code for the block
// End label, jump out of loop
compiler->out("br %s\n", redoLabel->getName()); // goto redoLabel
compiler->out("%s:\n", endLabel->getName()); // endLabel for goto out of loop
}

How to parse DSL input to high performance expression template

(EDITED both title and main text and created a spin-off question that arose)
For our application it would be ideal to parse a simple DSL of logical expressions. However the way I'd like to do this is to parse (at runtime) the input text which gives the expressions into some lazily evaluated structure (an expression template) which can then be later used within more performance sensitive code.
Ideally the evaluation is as fast as possible using this technique as it will be used a large number of times with different values substituting into the placeholders each time. I'm not expecting the expression template to be equivalent in performance to say a hardcoded function that models the same function as the given input text string i.e. there is no need to go down a route of actually compiling say, c++, in situ of a running program (I believe other questions cover dynamic library compiling/loading).
My own thoughts reading examples from boost is that I can use boost::spirit to do the parsing of the input text and I'm confident I can develop the grammar I need. However, I'm not sure how I can combine the parser with boost::proto to build an executable expression template. Most examples of spirit that I've seen are merely interpreters or end up building some kind of syntax tree but go no further. Most examples of proto that I've seen assume the DSL is embedded in the host source code and does not need to be initially interpreted from a string. I'm aware that boost::spirit is actually implemented with boost::proto but not sure if this is relevant to the problem or whether that fact will suggest a convenient solution.
To re-iterate, I need to be able to make real the something like following:
const std::string input_text("a && b || c");
// const std::string input_text(get_dsl_string_from_file("expression1.dsl"));
Expression expr(input_text);
while(keep_intensively_processing) {
...
Context context(…);
// e.g. context.a = false; context.b=false; context.c=true;
bool result(evaluate(expr, context));
...
}
I would really appreciate a minimal example or even just a small kernel that I can build upon that creates an expression from input text which is evaluated later in context.
I don't think this is exactly the same question as posted here: parsing boolean expressions with boost spirit
as I'm not convinced this is necessarily the quickest executing way of doing this, even though it looks very clever. In time I'll try to do a benchmark of all answers posted.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?

I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.

It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

How to store parsed function expressions for plugging-in many times?

As the topic indicates, my program needs to read several function expressions and plug-in different variables many times. Parsing the whole expression again every time I need to plug-in a new value is definitely way too ugly, so I need a way to store parsed expression.
The expression may look like 2x + sin(tan(5x)) + x^2. Oh, and the very important point -- I'm using C++.
Currently I have three ideas on it, but all not very elegant:
Storing the S-expression as a tree; evaluate it by recurring. It may
be the old-school way to handle this, but it's ugly, and I would
have to handle with different number of parameters (like + vs. sin).
Composing anonymous functions with boost::lambda. It may work nice,
but personally I don't like boost.
Writing a small python/lisp script, use its native lambda
expression and call it with IPC... Well, this is crazy.
So, any ideas?
UPDATE:
I did not try to implement support for parenthesis and functions with only one parameter, like sin().
I tried the second way first; but I did not use boost::lambda, but a feature of gcc which could be used to create (fake) anonymous functions I found from here. The resulting code has 340 lines, and not working correctly because of scoping and a subtle issue with stack.
Using lambda could not make it better; and I don't know if it could handle with scoping correctly. So sorry for not testing boost::lambda.
Storing the parsed string as S-expressions would definitely work, but the implementation would be even longer -- maybe ~500 lines? My project is not that kind of gigantic projects with tens of thousands lines of code, so devoting so much energy on maintaining that kind of twisted code which would not be used very often seems not a nice idea.
So finally I tried the third method -- it's awesome! The Python script has only 50 lines, pretty neat and easy to read. But, on the other hand, it would also make python a prerequisite of my program. It's not that bad on *nix machines, but on windows... I guess it would be very painful for the non-programmers to install Python. So is lisp.
However, my final solution is opening bc as a subprocess. Maybe it's a bad choice for most situations, however, it fits me well.
On the other hand, for projects work only under *nix or already have python as a prerequisite, personally I recommend the third way if the expression is simple enough to be parsed with hand-written parser. If it's very complicated, like Hurkyl said, you could consider creating a mini-language.

Why not use a scripting language designed for exactly this kind of purpose? There are several such languages floating around, but my experience is with lua.
I use lua to do this kind of thing "all the time". The code to embed and parse an expression like that is very small. It would look something like this (untested):
std::string my_expression = "2*x + math.sin( math.tan( x ) ) + x * x";
//Initialise lua and load the basic math library.
lua_State * L = lua_open();
lua_openmath(L);
//Create your function and load it into lua
std::string fn = "function myfunction(x) return "+my_expression+"end";
luaL_dostring( L, fn.c_str(), fn.size() );
//Use your function
for(int i=0; i<10; ++i)
{
// add the function to the stack
lua_getfield(L, LUA_GLOBALSINDEX, "myfunction");
// add the argument to the stack
lua_pushnumber(L, i);
// Make the call, using one argument and expecting one result.
// stack looks like this : FN ARG
lua_pcall(L,1,1)
// stack looks like this now : RESULT
// so get the result and print it
double result = lua_getnumber(L,-1);
std::cout<<i<<" : "<<result<<std::endl;
// The result is still on the stack, so clean it up.
lua_pop(L,1);
}

complexity of parsing C++

Out of curiosity, I was wondering what were some "theoretical" results about parsing C++.
Let n be the size of my project (in LOC, for example, but since we'll deal with big-O it's not very important)
Is C++ parsed in O(n) ? If not, what's the complexity?
Is C (or Java or any simpler language in the sense of its grammar) parsed in O(n)?
Will C++1x introduce new features that will be even harder to parse?
References would be greatly appreciated!

I think the term "parsing" is being interpreted by different people in different ways for the purposes of the question.
In a narrow technical sense, parsing is merely verifying the the source code matches the grammar (or perhaps even building a tree).
There's a rather widespread folk theorem that says you cannot parse C++ (in this sense) at all because you must resolve the meaning of certain symbols to parse. That folk theorem is simply wrong.
It arises from the use of "weak" (LALR or backtracking recursive descent) parsers, which, if they commit to the wrong choice of several possible subparse of a locally ambiguous part of text (this SO thread discusses an example), fail completely by virtue of sometimes making that choice. The way those that use such parser resolve the dilemma is collect symbol table data as parsing occurs and mash extra checking into the parsing process to force the parser to make the right choice when such choice is encountered. This works at the cost of significantly tangling name and type resolution with parsing, which makes building such parsers really hard. But, at least for legacy GCC, they used LALR which is linear time on parsing and I don't think that much more expensive if you include the name/type capture that the parser does (there's more to do after parsing because I don't think they do it all).
There are at least two implementations of C++ parsers done using "pure" GLR parsing technology, which simply admits that the parse may be locally ambiguous and captures the multiple parses without comment or significant overhead. GLR parsers are linear time where there are no local ambiguities. They are more expensive in the ambiguity region, but as a practical matter, most the of source text in a standard C++ program falls into the "linear time" part. So the effective rate is linear, even capturing the ambiguities. Both of the implemented parsers do name and type resolution after parsing and use inconsistencies to eliminate the incorrect ambiguous parses.
(The two implementations are Elsa and our (SD's) C++ Front End. I can't speak for Elsa's current capability (I don't think it has been updated in years), but ours does all of C++11 [EDIT Jan 2015: now full C++14 EDIT Oct 2017: now full C++17] including GCC and Microsoft variants).
If you take the hard computer science definition that a language is extensionally defined as an arbitrary set of strings (Grammars are supposed to be succinct ways to encode that intensionally) and treating parsing as "check the the syntax of the program is correct" then for C++ you have expand the templates to verify that each actually expands completely. There's a Turing machine hiding in the templates, so in theory checking that a C++ program is valid is impossible (no time limits). Real compilers (honoring the standard) place fixed constraints on how much template unfolding they'll do, and so does real memory, so in practice C++ compilers finish. But they can take arbitrarily long to "parse" a program in this sense. And I think that's the answer most people care about.
As a practical matter, I'd guess most templates are actually pretty simple, so C++ compilers can finish as fast as other compilers on average. Only people crazy enough to write Turing machines in templates pay a serious price. (Opinion: the price is really the conceptual cost of shoehorning complicated things onto templates, not the compiler execution cost.)

Depends what you mean by "parsed", but if your parsing is supposed to include template instantiation, then not in general:
[Shortcut if you want to avoid reading the example - templates provide a rich enough computational model that instantiating them is, in general, a halting-style problem]
template<int N>
struct Triangle {
static const int value = Triangle<N-1>::value + N;
};
template<>
struct Triangle<0> {
static const int value = 0;
};
int main() {
return Triangle<127>::value;
}
Obviously, in this case the compiler could theoretically spot that triangle numbers have a simple generator function, and calculate the return value using that. But otherwise, instantiating Triangle<k> is going to take O(k) time, and clearly k can go up pretty quickly with the size of this program, as far as the limit of the int type.
[End of shortcut]
Now, in order to know whether Triangle<127>::value is an object or a type, the compiler in effect must instantiate Triangle<127> (again, maybe in this case it could take a shortcut since value is defined as an object in every template specialization, but not in general). Whether a symbol represents an object or a type is relevant to the grammar of C++, so I would probably argue that "parsing" C++ does require template instantiation. Other definitions might vary, though.
Actual implementations arbitrarily cap the depth of template instantiation, making a big-O analysis irrelevant, but I ignore that since in any case actual implementations have natural resource limits, also making big-O analysis irrelevant...
I expect you can produce similarly-difficult programs in C with recursive #include, although I'm not sure whether you intend to include the preprocessor as part of the parsing step.
Aside from that, C, in common with plenty of other languages, can have O(not very much) parsing. You may need symbol lookup and so on, which as David says in his answer cannot in general have a strict O(1) worst case bound, so O(n) might be a bit optimistic. Even an assembler might look up symbols for labels. But for example dynamic languages don't necessarily even need symbol lookup for parsing, since that might be done at runtime. If you pick a language where all the parser needs to do is establish which symbols are keywords, and do some kind of bracket-matching, then the Shunting Yard algorithm is O(n), so there's hope.

Hard to tell if C++ can be "just parsed", as - contrary to most languages - it cannot be analysed syntactically without performing semantic analysis at the same time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js