I want to create a computer language translator between two languages LANG1 and LANG2. More specifically, I want to translate code written in LANG1 into source code in LANG2.
I have the BNF Grammar for both LANG1 and LANG2.
LANG1 is a small DSL I wrote by myself, and is essentially, an "easier" version of LANG2.
I want to be able to generate statements in LANG2, from input statements written in LANG1.
I am in the process of compiling a compiler for LANG1, but I don't know what to do next (in order to translate LANG1 statements to LANG2 statements).
My understanding of the steps involved are as follows:
1. BNF for my DSL (LANG1) DONE
2. Reverse engineered the BNF for LANG2 DONE
3. Learning how to generate a compiler for LANG1 TODO
4. Translate LANG1 statements to LANG2 statements ???
What are the steps involved in order to generate LANG2 statements from LANG1 statements?
My code base is in C++, so I can use Parser generated in either C or C++.
PS: I will be using ANTLR3 to generate the compiler for LANG1
In the most general case you have to translate each possible section of the grammar from LANG1 into something appropriate for LANG2, or you have to dumb down into the simplest possible primitives of both languages, like assembly or combinators. This is a little time consuming and not very fun.
However, if the grammars are equivalent or share a lot of common ground, you might just be able to get away with just parsing into the same tree for both grammars and having output functions that can take your standardized tree and convert it back into LANG1 or LANG2 source (which is mostly the same as the general case but takes a lot more short cuts).
EDIT: Since I've just reread your question realized that you only want to translate one way, you need only worry about making the form of the tree suit LANG1 and just have your translate function for LANG2. But I hope my example is helpful anyway.
Example Tree construction:
Here are two different ANTLR grammars that produce the same extremely simple AST:
Grammar 1
The first one is the standard way of expressing addition:
grammar simpleAdd;
options {output=AST;}
tokens {
PLUS = '+';
}
expression : addition EOF!;
addition : NUMBER (PLUS NUMBER)+ -> ^(PLUS NUMBER*);
NUMBER :'0'..'9'+;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n')+ { $channel = HIDDEN; } ;
This will accept two or more integers and produce a tree with a PLUS node and all the numbers in a list to be added together. E.g.,
1 + 1 + 2 + 3 + 5
Grammar 2
The second grammar takes a less elegant form:
grammar horribleAdd;
options {output=AST;}
tokens {
PLUS = '+';
PLUS_FUNC = 'plus';
COMMA = ',';
LEFT_PARENS ='(';
RIGHT_PARENS=')';
}
expression : addition EOF!;
addition : PLUS_FUNC LEFT_PARENS NUMBER (COMMA NUMBER)+ RIGHT_PARENS -> ^(PLUS NUMBER*);
NUMBER :'0'..'9'+;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n')+ { $channel = HIDDEN; } ;
This grammar expects numbers given to a function (yes, I know functions don't really work like this, I'm just trying to keep the example as clear as possible). E.g.,
plus(1, 1, 2, 3, 5)
It produces exactly the same tree as the first grammar (a PLUS node with the numbers as children).
Now you don't need to worry what language your instructions came from, you can output it in whatever form you like. All you need to do is write a function that can convert this AST back into a language of your choosing.
Related
I'm trying to write an expression parser. One part I'm stuck on is breaking down an expression into blocks via its appropriate order of precedence.
I found the order of precedence for C++ operators here. But where exactly do I split the expression based on this?
I have to assume the worst of the user. Here's a really messy over-exaggerated test example:
if (test(s[4]) < 4 && b + 3 < r && a!=b && ((c | e) == (g | e)) ||
r % 7 < 4 * givemeanobj(a & c & e, b, hello(c)).method())
Perhaps it doesn't even evaluate, and if it doesn't I still need to break it down to determine that.
It should break down into blocks of singles and pairs connected by operators. Essentially it breaks down into a tree-structure where the branches are the groupings, and each node has two branches.
Following the order of precedence the first thing to do would be to evaluate the givemeanobj(), however that's an easy one to see. The next would be the multiplication sign. Does that split everything before the * into a separate , or just the 4? 4 * givemeanobj comes before the <, right? So that's the first grouping?
Is there a straightforward rule to follow for this?
Is there a straightforward rule to follow for this?
Yes, use a parser generator such as ANTLR. You write your language specification formally, and it will generate code which parses all valid expressions (and no invalid ones). ANTLR is nice in that it can give you an abstract syntax tree which you can easily traverse and evaluate.
Or, if the language you are parsing is actually C++, use Clang, which is a proper compiler and happens to be usable as a library as well.
I'm making a program to evaluate conditional proposition (~ or and -> <->). As the users input propositional variables and truth values (true, false) ,and proposition; the program will go through the inputs and return the truth value for the whole proposition.
For ex: if i set p = true, q = true, r = false and input: p or q and r.
Is the anyway I can cut it into q and r first, then process and put it back to result (which is false), then process the next bit (p or false) ??. And it has to keep cutting out bits (In proper order of Precedence) and putting them back until I have left is a single true or false.
And what I'm I supposed to use to hold user input (array, string) ???
Any help would be appreciated ! Thanks.
Tasks like this are usually split into two phases, lexical analysis and syntactic analysis.
Lexical analysis splits the input into a stream of tokens. In your case the tokens would be the operators ~, or, and, ->, <->, variables and the values true, false. You didn't mention them but I'd imagine you also want to include brackets as tokens in your language. Your language is simple enough that you could just write the lexical analyser yourself but tools such as flex or ragel might help you.
Synyactic analysis is where you tease out the syntactic structure of your input and perform whatever actions you need (evaluate the preposition in your case). Syntactic analysis is more complex than lexical analysys. You could write a recursive descent parser for this task, or you could use a parser generator to write the code for you. The traditional tool for this is called bison, but it's a bit clunky. I like another simple tool called the lemon parser generator although it's more C orientated than C++.
I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa
Basically, the language has 3 list and 3 fixed-length types, one of them is string.
This is simple to detect the type of a token using regular expressions, but splitting them into tokens is not that trivial.
String is notated with double-quote, and double-qoute is escaped with backslash.
EDIT:
Some example code
{
print (sum (1 2 3 4))
if [( 2 + 3 ) < 6] : {print ("Smaller")}
}
Lists like
() are argument lists that are only evaluated when necessary.
[] are special list to express 2 operand operations in a prettier
way.
{} are lists that are always evaluated. First element is a function
name, second is a list of arguments, and this repeats.
anything : anything [ : anything [: ...]] translate to argument lists that have the elements joined by the :s. This is only for making loops and conditionals look better.
All functions take a single argument. Argument lists can be used for functions that need more. You can fore and argument list to evaluate using different types of eval functions. (There would be eval functions for each list model)
So, if you understand this, this works very similar like Lisp does, it's only has different list types for prettifying the code.
EDIT:
#rici
[[2 + 3] < 6] is OK too. As I mentioned, argument lists are evaluated only when it's necessary. Since < is a function that requires an argument list of length 2, (2 + 3) must be evaluated somehow, other ways it [(2 + 3) < 6] would translate to < (2 + 3) : 6 which equals to < (2 + 3 6) which is and invalid argument list for <. But I see you point, it's not trivial that how automatic parsing in this case should work. The version that I described above, is that the [...] evaluates arguments list with a function like eval_as_oplist (...) But I guess you are right, because this way, you couldn't use an argument list in the regular way inside a [...] which is problematic even if you don't have a reason to do so, because it doesn't lead to a better code. So [[. . .] . .] is a better code, I agree.
Rather than inventing your own "Lisp-like, but simpler" language, you should consider using an existing Lisp (or Scheme) implementation and embedding it in your C++ application.
Although designing your own language and then writing your own parser and interpreter for it is surely good fun, you will have hard time to come up with something better designed, more powerful and implemented more efficiently and robustly than, say, Scheme and it's numerous implementations.
Chibi Scheme: http://code.google.com/p/chibi-scheme/ is particularly well suited for embedding in C/C++ code, it's very small and fast.
I would suggest using Flex (possibly with Bison) or ANTLR, which has a C++ output target.
Since google is simpler than finding stuff on my own file server, here is someone else's example:
http://ragnermagalhaes.blogspot.com/2007/08/bison-lisp-grammar.html
This example has formatting problems (which can be resolved by viewing the HTML in a text editor) and only supports one type of list, but it should help you get started and certainly shows how to split the items into tokens.
I believe Boost.Spirit would be suitable for this task provided you could construct a PEG-compatible grammar for the language you're proposing. It's not obvious from the examples as to whether or not this is the case.
More specifically, Spirit has a generalized AST called utree, and there is example code for parsing symbolic expressions (ie lisp syntax) into utree.
You don't have to use utree in order to take advantage of Spirit's parsing and lexing capabilities, but you would have to have your own AST representation. Maybe that's what you want?
In my program, written in C++, I need to take a set of strings, each containing the declaration of a C function, and perform a number of operations on them.
One of the operations is to compare whether one function is equal to another. To do that I plan to just prune away comments and intermediate whitespace which has no effect on the semantics of the function and then do a string comparison. However, I would like to retain whitespace within a string as removing that would change the output produced by the function.
I could write some code which iterates over the string characters and enters "string mode" whenever a quote (") is encountered and recognize escaped quotes, but I wonder if there is any better way of doing this. An idea is to use a full-fledged C parser, run it over the function string, ignore all comments and excessive whitespace, and then convert the AST back to a string again. But looking around at some C parser I get the feeling that most are a bitch to integrate with my source code (prove me wrong if I am). Perhaps I could try to use yacc or something and use an existing C grammar and implement the parser myself...
So, any ideas on the best way to do this?
EDIT:
The program I'm writing takes an abstract model and converts it into C code. The model consists of a graph, where the nodes may or may not contain segments of C code (more precisely, a C function definition where its execution must be completely deterministic (i.e. no global state) and no memory operations are allowed). The program does pattern matching on the graph and merges and splits certain nodes who adhere to these patterns. However, these operations can only be performed if the nodes exhibit the same functionality (i.e. if their C function definitions are the same). This "checking that they are the same" will be done by simply comparing the strings which contain the C function declarations. If they are character-by-character identical, then they are equal.
Due to the nature of how the models are generated, this is quite a reasonable method of comparison provided that the comments and excess whitespace is removed as this is the only factor that may differ. This is the problem I'm facing -- how to do this with minimal amount of implementation effort?
What do you mean by compare whether one function is equal to another ? With a suitably precise meaning, that problem is known to be undecidable!
You did not tell what your program is really doing. Parsing all real C programs correctly is not trivial (because the C language syntax and semantics is not that simple!).
Did you consider using existing tools or libraries to help you? LLVM Clang is a possibility, or extending GCC thru plugins, or even better with extensions coded in MELT.
But we cannot help you more without understanding your real goal. And parsing C code is probably more complex than what you imagine.
It looks like you can get away with simple island grammar removing comments, string literals, and collapsing white spaces (tabs, '\n'). Since I'm working with AXE†, I wrote a quick grammar‡ for you. You can write a similar set of rules using Boost.Spirit.
#include <axe.h>
#include <string>
template<class I>
std::string clean_text(I i1, I i2)
{
// rules for non-recursive comments, and no line continuation
auto endl = axe::r_lit('\n');
auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
auto cpp_comment = "//" & axe::r_find(endl);
auto comment = c_comment | cpp_comment;
// rules for string literals
auto esc_backslash = axe::r_lit("\\\\");
auto esc_quote = axe::r_lit("\\\"");
auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote)
& *(esc_backslash | esc_quote)) & '"';
auto space = axe::r_any(" \t\n");
auto dont_care = *(axe::r_any() - comment - string_literal - space);
std::string result;
// semantic actions
// append everything matched
auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
// append a single space
auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });
// island grammar for text
auto text = *(dont_care >> append_all
& *comment
& *string_literal >> append_all
& *(space % comment) >> append_space)
& axe::r_end();
if(text(i1, i2).matched)
return result;
else
throw "error";
}
So now you can do the text cleaning:
std::string text; // this is your function
text = clean_text(text.begin(), text.end());
You might also need to create rules for superfluous ';', empty blocks {}, and alike. You might also need to merge string literals. How far you need to go depends on the way the functions were generated, you may end up writing a sizable portion of C grammar.
† AXE library is soon to be released under boost license.
‡ I didn't test the code.
Perhaps your C functions that you want to parse are not as general (in their textual form, and also as parsed by a real compiler) as we are guessing.
You might consider doing things the other way round:
It could make sense to define a small domain specific language (it could have a syntax much simpler to parse than C) and instead of parsing C code, doing it the other way: The user would use your DSL, and your tool would generate C code (to be compiled at a later stage by your usual C compiler) from your DSL.
Your DSL could actually be the description of your abstract model mixed with more procedural parts which are translated to C functions. Since the C functions you care about are quite specific, the DSL generating them could be small.
(Think that many parser generators like ANTLR or YACC or Bison are build on a similar idea).
I actually did something quite similar in MELT (read notably my DSL2011 paper). You might find some useful tricks about designing a DSL translated to C.