AST for any arbitrary programming language or IR - c++

Is it possible to create an AST for any arbitrary programming language or IR using C or C++ alone (without the help of tools like YACC and LEX )?
If so, how to implement the lexical and syntactic analysis ?
If not, what are the tools that have to augmented to C or C++ to successfully create an AST ?
Hope I made my doubt clear. If My question looks vague or out of context, please indicate the required.
P.S : I Am actually trying to create the AST for the LLVM's .ll format of IR representation. I do know that .ll is derived from AST. But I am trying out static analysis practices. So I am looking at creating the AST.

The most straight-forward methodology for creating the parser without a parser-generator is recursive descent. It is very well documented - the standard book in the field is The Dragon Book.
A scanner, that takes text as input and produces a string of tokens as output, can be written using standard string manipulation techniques.

I doubt there's a one-to-one mapping between your arbitrary langauge and LLVM's ASTs.
That means it is likely that you really want to do this in two stages:
Parse your 'arbitrary language' using the best parsing tools you can get to simplify the problem of parsing your language. Use that to build an AST for your language, following standard methods for parser generators producing ASTs. LEX/YACC are OK but there are plenty of good alternatives out there. Its pretty likely you'll need to build a symbol table.
Walk the AST of the your parsed langauge to build your LLVM AST. This won't be one-to-one, but the ability to look around the tree near a tree node in your AST to collect information need to generate the LLVM code will likely be extremely helpful.
This is a classic style for a simple compiler.
I suggest you read the Aho/Ullman Dragon book on syntax directed translation. A day's worth of education will save you months of wasted engineering time.

Related

From propietary format to c++ classes

Given an input like this (USER DEFINED FORMAT):
type dog<
int years
char[] name
>
How can I generate 2 or more differents files like these:
file1.c
------------
struct dog{
int years
char name
}
file2.cpp
-------------
class dog{
int years
string name
%get and set methods
}
Is a parser generator like flex and bison the best way? Or are there better way?
It all depends on the complextity of your language and the transformations you intend.
Option 1: hand-crafted ad-hoc processing
If your transformations are simple and mostly convert some tokens from your source language into the target language (e.g "type" into "struct" and "<>" into "{}"), you could think of a hand-crafted parsing using string operations or regexes using std::regex.
But this is not sufficient here: you must identify statements in your source langage to generate semi-column separated c and c++ statements. Furhtermore, you must related the [] to the right element. Furthermore, to generate correct getter/setter your code must get understanding of the statement content.
So yes, here you'll need a lexer and a parser.
Option 2: hand-crafted parser
For "simple" languages, you could craft easily a recursive descent parser. What does "simple" mean ? It means that the language which can be described with a LL(1) grammar. Without entering now too much in compiler theory, it's more or less languages where one token ahead uniquely determine which grammar construct you have.
Note that you have a very simple calculator example designed this way in the "C++ Programming language" from Bjarne Stroustrup.
Option 3: use a compiler generator
While lexical scanners and recursive descent parsers are relatively easy to code, it's still somehow reinventing the wheel. And for languages which are not LL(1) compliant, you have no other choice than using a compiler generator.
Flex/bison is then a very reasonable and widely supported choice. But you have to become familiar with its logic, grammar files, and code templates. It's not something that you could master in a couple of hours.
If you have to support a custom language professionally, it's really worth the investment. If it's a small project, the learning curve could play against you.
Option 4: lightweight boost alternative
Another, lighter, approach could be to use boost spirit, which bases itself on c++ and is easier to learn: you express the grammar rule directly in C++. There are too main components: Lex for lexical scanner and Qi for parsers.
It's ok for small to medium parsers (like a data definition language), but it has some serious limitations if you have a full language. And by the way, as far as I know, it's limited to LL(k) grammars.

Can a LL(*) parser (like antlr3) parse C++?

I need to create a parser to C++ 14. This parser must be created using C++ in order to enable a legacy code generation reuse. I am thinking to implement this using ANTLR3 (because ANTLR4 doesn't target C++ code yet).
My doubt is if ANTLR3 can parse C++ since it doesn't use the Adaptive LL(*) algorithm like ANTLR4.
Most classic parser generators cannot generate a parser that will parse a grammar for an arbitrary context free language. The restrictions of the grammars they can parse often gives rise to the name of the class of parser generators: LL(k), LALR, ... ANTLR3 is essentially LL; ANTLR4 is better but still not context free.
Earley, GLR, and GLL parser generators can parse context free languages, sometimes with high costs. In practice, Earley tends to be pretty slow (but see the MARPA parser generator used with Perl6, which I understand to be an Earley variant that is claimed to be reasonably fast). GLR and GLL seem to produce working parsers with reasonable performance.
My company has built about 40 parsers for real languages using GLR, including all of C++14, so I have a lot of confidence in the utility of GLR.
When it comes to parsing C++, you're in a whole other world, mostly because C++ parsing seems to depend on collecting symbol table information at the same time. (It isn't really necessary to do that if you can parse context-free).
You can probably make ANTLR4 (and even ANTLR3) parse C++ if you are willing to fight it hard enough. Essentially what you do is build a parser which accepts too much [often due to limitations of the parser generator class], and then uses ad hoc methods to strip away the extra. This is essentially what the hand-written GCC and Clang parsers do; the symbol table information is used to force the parser down the right path.
If you choose to go down this path of building your own parser, no matter which parser generator you choose, you will invest huge amounts of energy to get a working parser. [Been here; done this]. This isn't a good way to get on with whatever your intended task motivates this parser.
I suggest you get one that already works. (I've already listed two; you can find out about our parser through my bio if you want).
That will presumably leave you with a working parser. Then you want to do something with the parse tree, and you'll discover that Life After Parsing requires a lot of machinery that the parsers don't provide. Google the phrase to find my essay on the topic or check my bio.

Unknown type of macro in C++

I noticed something when looking the kind of language of a software (openFOAM) which is written in C++. It was something like,
field value;
For example,
temperature 25;
I wonder how this works. I mean how temperature is set to 25 without using equality sign. Any idea?
Because openFOAM has a parser that understands that format.
Such a parser is not hard to make, and C++ (with the help of a parser-generator tool like yacc, bison) is a popular choice for parsers because it can be very fast.
C++ is a Turing-complete language, therefore it can do anything that any other language can. Specifically, it can process data that doesn't look like C++ code.
Programming languages are typically context free. If the language is context free it can be parsed (I won't say it's trivial but it's a problem that has been solved so many times before and if you take a compilers class you'll have to do it for simple langauges).
The parser is simply looking for a declarations of the format field value;. It appears there is no type check happening so it is simply splitting on semi colon then on the space. Read about parsing context-free langauges, push down automata, and context free grammars if you're interested in learning about parsing source code.

Parse tree for SQL statements - precisely for "SELECT" statement

I am writing (hand written) recursive descent parser for SQL select statement in c++, i need to know whether the parse tree created by me is correct or not. I want to check but i didn't get a good sources for sql parse trees. My way of approach is - writing a function for each production and in that function the result is adding to the root tree. Can any one help me? Thanks in advance.
I don't know how you'll go about verifying your code is correct, but if you're concerned about your understanding of the SQL grammar, then here is a website that lists BNF grammars for various dialects of SQL. You ought to be able construct your parser in terms of these rules.
My company builds a lot of parsers, and have your same problem. We recently finished a SQL 2011 parser based on the draft standard.
Pretty much you decide if the parse tree is right by hand-inspecting it for many source code cases. This presumes that you can print the parse tree in a form that you can easily inspect; this is easily accomplished by a recursive tree walk of the parse tree. [You have to already believe that your abstract syntax tree nodes correctly model what you intend to capture!]. You choose the cases carefully to exercise different parts of the grammar (think "unit tests for grammars"). For a langauge as rich as SQL, this is a big job.
You also need to validate that the parser works in general, and you do that by feeding a lot of real code for the particular dialect of SQL you are handling. I typically try to find 100K-1M SLOC, and if the parser can't eat all of that, I have still have work left to do. Once you get to that level, you sort of consider that your parser is OK and treat further errors as "maintenance issues".
While the following may not help you directly, it might hint at a direction in which you could head. I use a somewhat different approach, based on having extremely strong parsing machinery available. Our tool, the DMS Software Reengineering Toolkit, given a grammar, will produce ASTs automatically, and has built-in facilities to print such parse trees (in one form as XML). The AST has sufficient information to regenerate ("prettyprint") the source text, and DMS has a built-in prettyprinter. So after hand inspecting a variety of cases, what I do is to take a large body of code, and for each file, parse it (getting no parse errors by virtue of the work done above), prettyprint the source, and reparse the source (expecting to get no errors). This is strong hint that we haven't lost anything in the round trip.
We have a new tool available, the Smart Differencer that compares the text of two programs to see if they are "the same" ignoring language layout rules. It works in essence by parsing two files and comparting their parse trees, ignoring the formatting (line/column/escapes/radix/comments/whitespace). What we are starting to do is to parse the source code, prettyprint it, and the smart-diffing the prettytprinted result against the original file. SmartDiff should say "no AST differences". This is a much stronger hint that we haven't lost anything. You can do pretty the much the same if you are willing to compare your before-and-after printed parse trees.
This parser, based on pyparsing, might be helpful as a second SELECT parsing resource (although it is in Python, not C++, sorry).

Compiler-Programming: What are the most fundamental ingredients?

I am interested in writing a very minimalistic compiler.
I want to write a small piece of software (in C/C++) that fulfills the following criteria:
output in ELF format (*nix)
input is a single textfile
C-like grammar and syntax
no linker
no preprocessor
very small (max. 1-2 KLOC)
Language features:
native data types: char, int and floats
arrays (for all native data types)
variables
control structures (if-else)
functions
loops (would be nice)
simple algebra (div, add, sub, mul, boolean expressions, bit-shift, etc.)
inline asm (for system calls)
Can anybody tell me how to start? I don't know what parts a compiler consists of (at least not in the sense that I just could start right off the shelf) and how to program them. Thank you for your ideas.
With all that you hope to accomplish, the most challenging requirement might be "very small (max. 1-2 KLOC)". I think your first requirement alone (generating ELF output) might take well over a thousand lines of code by itself.
One way to simplify the problem, at least to start with, is to generate code in assembly language text that you then feed into an existing assembler (nasm would be a good choice). The assembler would take care of generating the actual machine code, as well as all the ELF specific code required to build an actual runnable executable. Then your job is reduced to language parsing and assembly code generation. When your project matures to the point where you want to remove the dependency on an assembler, you can rewrite this part yourself and plug it in at any time.
If I were you, I might start with an assembler and build pieces on top of it. The simplest "compiler" might take a language with just a few very simple possible statements:
print "hello"
a = 5
print a
and translate that to assembly language. Once you get that working, then you can build a lexer and parser and abstract syntax tree and code generator, which are most of the parts you'll need for a modern block structured language.
Good luck!
Firstly, you need to decide whether you are going to make a compiler or an interpreter. A compiler translates your code into something that can be run either directly on hardware, in an interpreter, or get compiled into another language which then is interpreted in some way. Both types of languages are turing complete so they have the same expressive capabilities. I would suggest that you create a compiler which compiles your code into either .net or Java bytecode, as it gives you a very optimized interpreter to run on as well as a lot of standard libraries.
Once you made your decision there are some common steps to follow
Language definition Firstly, you have to define how your language should look syntactically.
Lexer The second step is to create the keywords of your code, known as tokens. Here, we are talking about very basic elements such as numbers, addition sign, and strings.
Parsing The next step is to create a grammar that matches your list of tokens. You can define your grammar using e.g. a context-free grammar. A number of tools can be fed with one of these grammars and create the parser for you. Usually, the parsed tokens are organized into a parse tree. A parse tree is the representation of your grammar as a data structure which you can move around in.
Compiling or Interpreting The last step is to run some logic on your parse tree. A simple way to make your own interpreter is to create some logic associated to each node type in your tree and walk through the tree either bottom-up or top-down. If you want to compile to another language you can insert the logic of how to translate the code in the nodes instead.
Wikipedia is great for learning more, you might want to start here.
Concerning real-world reading material I would suggest "Programming language processors in JAVA" by David A Watt & Deryck F Brown. I used that book in my compilers course and learning by example is great in this field.
These are the absolutely essential parts:
Scanner: This breaks the input file into tokens
Parser: This constructs an abstract syntax tree (AST) from the tokens identified by the scanner.
Code generation: This produces the output from the AST.
You'll also probably want:
Error handling: This tells the parser what to do if it encounters an unexpected token
Optimization: This will enable the compiler to produce more efficient machine code
Edit: Have you already designed the language? If not, you'll want to look into language design, too.
I don't know what you hope to get out of this, but if it is learning, and looking at existing code works for you, there is always tcc.
The number one essential is a book on compiler writing. A lot of people will tell you to read the "Dragon Book" by Aho et al, but the best book I've read on compilers is "Brinch Hansen on Pascal Compilers". I suspect it's out of print (Amazon is your friend), but it takes you through all the steps of designing and writing a compiler using recursive descent, which is the easiest method for compiler newbies to understand.
Although the book uses Pascal as the implementation and target languages, the lessons and techniques presented apply equally to all other languages.
The examples are all in Perl, but Exploring Programming Language Architecture in Perl is a good book (and free).
A really good set of free references, IMHO, are:
Overall compiler tutorial: Let's Build a Compiler by Jack Crenshaw (http://compilers.iecc.com/crenshaw/) It's wordy, but I like it.
Assembler: NASM (nasm.us) good for Linux and Windows/DOS, and most importantly lots of doco and examples/tutorials. (FASM is also good but less documentation/tutorials out there)
Other sources
The PC Assembly book (http://www.drpaulcarter.com/pcasm/index.php)
I'm trying to write a LISP, so I'm using the Lisp 1.5 Manual. You may want to get the language spec for whatever language you're writing.
As far as 1-2KLOC, assuming you use a high level language (like Py or Rb) you should be close if you're not too ambitious.
I always recommend flex and bison for this kind of work as a beginner. You can always learn the ins and outs of writing your own scanner and parser later, although they may increase the code size at least they will be generated for you by tools. :)