Faithfully handle white-spacing in a pretty-printer - ocaml

I am writing a front-end for a language (by ocamllex and ocamlyacc).
So the frond-end can build a Abstract Syntax Tree (AST) from a program. Then we often write a pretty printer, which takes an AST and print a program. If later we just want to compile or analyse the AST, most of the time, we don't need the printed program to be exactly the same as the original program, in terms of white-spacing. However, this time, I want to write a pretty printer that prints exactly the same program as the original one, in terms of white-spacing.
Therefore, my question is what are best practices to handle white-spacing while trying not to modify too much the types of AST. I really don't want to add a number (of white-spaces) to each type in the AST.
For example, this is how I currently deal with (ie, skip) white-spacing in lexer.mll:
rule token = parse
...
| [' ' '\t'] { token lexbuf } (* skip blanks *)
| eof { EOF }
Does anyone know how to change this as well as other parts of the front-end to correctly taking white-spacing into account for a later printing?

It's quite common to keep source-file location information for each token. This information allows for more accurate errors, for example.
The most general way to do this is to keep the beginning and ending line number and column position for each token, which is a total of four numbers. If it were easy to compute the end position of a token from its value and the start position, that could be reduced to two numbers, but at the price of extra code complexity.
Bison has some features which simplify the bookkeeping work of remembering location objects; it's possible that ocamlyacc includes similar features, but I didn't see anything in the documentation. In any case, it is straight-forward to maintain a location object associated with each input token.
With that information, it is easy to recreate the whitespace between two adjacent tokens, as long as what separated the tokens was whitespace. Comments are another issue.
It's a judgement call whether or not that is simpler than just attaching preceding whitespace (and even comments) to each token as it is lexed.

You can have match statements that print different number of spaces depending on the token you're dealing with. I would usually print 1 space before if the token is an: id,num,define statement, assign(=)
If the token is an arithmetic expression I would print one space before and one space after it.
if you are dealing with an if or while statement I would indent the body by four spaces.
I think the best bet would be to write a pretty_print function such as:
let rec pretty_print pos ast =
match ast with
|Some_token -> String.make pos ' '; (* adds 'pos' number of spaces; pos will start off as zero. *)
print_string "Some_token";
|Other_token...
In sum I would handle white spaces by matching each token individually in a recursive function, and printing out the appropriate number of spaces.

Related

Tokenize parse option

Consider a slightly different toy example from my previous question:
. local string my first name is Pearly,, and my surname is Spencer
. tokenize "`string'", parse(",,")
. display "`1'"
my first name is Pearly
. display "`2'"
,
. display "`3'"
,
. display "`4'"
and my surname is Spencer
I have two questions:
Does tokenize work as expected in this case? I thought local macro
2 should be ,, instead of , while local macro 3 contain the rest of the string (and local macro 4 be empty).
Is there a way to force tokenize to respect the double comma as a parsing
character?
tokenize -- and gettoken too -- won't, from what I can see, accept repeated characters such as ,, as a composite parsing character. ,, is not illegal as a specification of parsing characters, but is just understood as meaning that , and , are acceptable parsing characters. The repetition in practice is ignored, just as adding "My name is Pearly" after "My name is Pearly" doesn't add information in a conversation.
To back up: know that without other instructions (such as might be given by a syntax command) Stata will parse a string according to spaces, except that double quotes (or compound double quotes) bind harder than spaces separate.
tokenize -- and gettoken too -- will accept multiple parse characters pchars and the help for tokenize gives an example with space and + sign. (It's much more common, in my experience, to want to use space and comma , when the syntax for a command is not quite what syntax parses completely.)
A difference between space and the other parsing characters is that spaces are discarded but other parsing characters are not discarded. The rationale here is that those characters often have meaning you might want to take forward. Thus in setting up syntax for a command option, you might want to allow something like myoption( varname [, suboptions])
and so whether a comma is present and other stuff follows is important for later code.
With composite characters, so that you are looking for say ,, as separators I think you'd need to loop around using substr() or an equivalent. In practice an easier work-around might be first to replace your composite characters with some neutral single character and then apply tokenize. That could need to rely on knowing that that neutral character should not occur otherwise. Thus I often use # as a character placeholder because I know that it will not occur as part of variable or scalar names and it's not part of function names or an operator.
For what it's worth, I note that in first writing split I allowed composite characters as separators. As I recall, a trigger to that was a question on Statalist which was about data for legal cases with multiple variations on VS (versus) to indicate which party was which. This example survives into the help for the official command.
On what is a "serious" bug, much depends on judgment. I think a programmer would just discover on trying it out that composite characters don't work as desired with tokenize in cases like yours.

What does Tokens do and why they need to be created in C++ programming?

I am reading a book (Programming Principles and Practice by Bjarne Stroustrup).
In which he introduce Tokens:
“A token is a sequence of characters that represents something we consider a unit, such as a number or an operator. That’s the way a C++ compiler deals with its source. Actually, “tokenizing” in some form or another is the way most analysis of text starts.”
class Token {
public:
char kind;
double value;
};
I do get what they are but he never explains this in detail and its quite confusing to me.
Tokenizing is important to the process of figuring out what a program does. What Bjarne is referring to in relation to C++ source deals with how a programs meaning is affected by the tokenization rules. In particular, we must know what the tokens are, and how they are determined. Specifically, how can we identify a single token when it appears next to other characters, and how should we delimit tokens if there is ambiguity.
For instance, consider the prefix operators ++ and +. Let's assume we only had one token + to work with. What is the meaning of the following snippet?
int i = 1;
++i;
With + only, is the above going to just apply unary + on i twice? Or is it going to increment it once? It's ambiguous, naturally. We need an additional token, and therefore introduce ++ as it's own "word" in the language.
But now there is another (albeit smaller) problem. What if the programmer wants to just apply unary + twice, and not increment? Token processing rules are needed. So if we determine that a white space is always a separator for tokens, our programmer may write:
int i = 1;
+ +i;
Roughly speaking, a C++ implementation starts with a file full of characters, transforms them initially to a sequence of tokens ("words" with meaning in the C++ language), and then checks if the tokens appear in a "sentence" that has some valid meaning.
He's refering to the lexical analysis - the necessary piece of every compiler. It is a tool for the compiler to treat a text (as in: a sequence of bytes) in a meaningful way. For example consider the following line in C++
double x = (15*3.0); // my variable
when the compiler looks at the text it first splits the line into a sequence of tokens which may look like this:
Token {"identifier", "double"}
Token {"space", " "}
Token {"identifier", "x"}
Token {"space", " "}
Token {"operator", "="}
Token {"space", " "}
Token {"separator", "("}
Token {"literal_integer", "15"}
Token {"operator", "*"}
Token {"literal_float", "3.0"}
Token {"separator", ")"}
Token {"separator", ";"}
Token {"space", " "}
Token {"comment", "// my variable"}
Token {"end_of_line"}
It doesn't have to be interpreted like above (note that in my case both kind and value are strings), its just an example how it can be done. You usually do this via some regular expressions.
Anyway tokens are easier to understand for the machine then a raw text. Next step for the compiler is to create so called abstract syntax tree based on the tokenization and finally add meaning to everything.
Also note that unless you are writing a parser it is unlikely you will ever use the concept.
As mentioned by others Bjrane is referring to Lexical analysis.
In general terms tokenizing || creating tokens, is a process of processing input streams and dividing them into blocks, without worrying about whitespaces etc. best described earlier by #StoryTeller.
"or as bjrane said: is a sequence of characters that represent something we consider a unit".
The token itself is an example of a C++ user-defined type'UDT' like int or char, so token can be used to define variables and hold values.
UDT can have member functions as well as data members. In your code you define two member functions which is very basic.
1)Kind, 2)Value
class Token {
public:
char kind;
double value;
};
Based on it we can initialize or construct its objects.
Token token_kind_one{'+'};
Initializing token_kind_one with its kind(operator) '+'.
Token token_kind_two{'8',3.14};
and token_kind_two with its kind(integer/number) '8' and with a value of 3.14.
Lets assume we have an expression of ten characters 1+2*3(5/4), which translates to ten tokens.
Tokens:
|----------------------|---------------------|
Kind |'8' |'+' |'8' |'*'|'8'|'('|'8' |'/'|'8' |')'|
|----------------------|---------------------|
Value | 1 | | 2 | | 3 | | 5 | | 4 | |
|----------------------|---------------------|
C++ compiler transfer file data to a token sequence skipping all whitespaces. To make it understandable to itself.
Broadly speaking, a compiler will run multiple operations on a given source code, before converting it into a binary format. One of the first stages is running a tokenizer, where the contents of a source file are converted to Tokens, which are units understood by the compiler. For example, if you write a statementint a, the tokenizer might create a structure to store this information.
Type: integer
Identifier: A
Reserved Word: No
Line number: 10
This would be then referred to as a token, and most of the code in a source file will be broken down into similar structures.

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).
For instance,
begin x:=1;y:=2;
then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).
Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.
Although performance-wise it can be more efficient if you do it yourself, it isn't a must.
Using regular expressions is THE traditional way to generate your tokens.
lex and yacc (or flex and bison) are a traditional compiler creation pair, where lex does nothing except tokenize symbols and pass them to YACC
http://en.wikipedia.org/wiki/Lex_%28software%29
YACC is a stack based state machine (pushdown automaton) that processes the symbols.
I think regex processing is the way to go for parsing symbols of any level of complexity. As Oak mentions, you'll end up writing your own (probably inferior) regex parser. The only exception would be if it is dead simple, and even your posted example starts to exceed "dead simple".
in lex syntax:
:= return ASSIGN_TOKEN_OR_WHATEVER;
begin return BEGIN_TOKEN;
[0-9]+ return NUMBER;
[a-zA-Z][a-zA-Z0-9]* return WORD;
Character sequences are optionally passed along with the token.
Individual characters that are tokens in their own right [e.g. ";" )get passed along unmodified. Its not the only way, but I have found it to work very well.
Have a look:
http://www.faqs.org/docs/Linux-HOWTO/Lex-YACC-HOWTO.html

Counting lines of code

I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?
There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.
At its sourceforge page you can find the perl source code for download.
Well, if by line counters, you mean programs which count lines, then the
algorithm is pretty trivial: just count the number of '\n' in the
code. If, on the other hand, you mean programs which count C++
statements, or produce other metrics... Although not 100% accurate,
I've gotten pretty good results in the past just by counting '}' and
';' (ignoring those in comments and string and character literals, of
course). Anything more accurate would probably require parsing the
actual C++.
You don't need to actually parse the code to count line numbers, it's enough to tokenise it.
The algorithm could look like:
int lastLine = -1;
int lines = 0;
for each token {
if (isCode(token) && lastLine != token.line) {
++lines;
lastLine = token.line;
}
}
The only information you need to collect during tokenisation is:
what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
at which line in the file the token occures.
On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.
EDIT
I've mentioned "tokenisation", let me describe it for you quickly:
Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:
#include "something.h"
/*
This is my program.
It is quite useless.
*/
int main() {
return something(2+3); // this is equal to 5
}
could look like:
PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)
Et cetera.
Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).
Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.
Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.
On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.
I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.