I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa
I've made a complete parser in bison ( and of course complete lexer in flex ), and I noticed only yesterday that I've a problem in my parser. In If structure in fact.
Here are my rules in my parser: http://pastebin.com/TneESwUx
Here the single IF is not recognized, and if I replace "%prec IFX" with "END", by adding a new token "END" ("end" return END; in flex), it works. But I don't want to have a new "end" keyword, that's why I don't use this solution.
Please help me.
'The correct way to handle this kind of rule is not precedence, it is refactoring to use an optional else-part so that the parser can use token lookahead to decide how to parse. I would design it something like this:
stmt : IF '(' expression ')' stmts else_part
| /* other statement productions here */
else_part : /* optional */
| ELSE stmts
stmts : stmt
| '{' stmt_list '}'
| '{' '}'
stmt_list : stmt
| stmt_list ';' stmt
(This method of special-casing stmts instead of allowing stmt to include a block may not be optimal in terms of productions, and may introduce oddities in your language, but without more details it's hard to say for certain. bison can produce a report showing you how the parser it generated works; you may want to study it. Also beware of unexpected shift/reduce conflicts and especially of any reduce/reduce conflicts.)
Note that shift/reduce conflicts are entirely normal in this kind of grammar; the point of an LALR(1) parser is to use these conflicts as a feature, looking ahead by a single token to resolve the conflict. They are reported specifically so that you can more easily detect the ones you don't want, that you introduce by incorrectly factoring your grammar.
Your IfExpression also needs to be refactored to match; the trick is that else_part should produce a conditional expression of some kind to $$, and in the production for IF you test $6 (corresponding to else_part) and invoke the appropriate IfExpression constructor.
Your grammar is ambiguous, so you have to live with the shift/reduce conflict. The END token eliminates the ambiguity by ensuring that an IF statement is always properly closed, like a pair of parentheses.
Parentheses make a good analogy here. Suppose you had a grammar like this:
maybe_closed_parens : '(' stuff
| '(' stuff ')'
;
stuff itself generates some grammar symbols, and one of them is maybe_closed_parens.
So if you have input like ((((((whatever, it is correct. The parentheses do not have to be closed. But what if you add )? Which ( is it closing?
This is very much like not being able to tell which IF matches an ELSE.
If you add END to the syntax of the IF (whether or not there is an ELSE), then that is like having a closing parentheses. IF and END are like ( and ).
Of course, you are stylistically right not to want the word END in your language, because you already have curly braced blocks, which are basically an alternate spelling for Pascal's BEGIN and END. Your } is already an END keyword.
So what you can do is impose the rule that an IF accepts only compound statements (i.e. fully braced):
if_statement : IF condition compound_statement
| IF condition compound_statement ELSE compound_statement
Now it is impossible to have an ambiguity like if x if y w else z because braces must be present: if x { if y { w } else { z } } or if x { if y { w } } else { z }.
I seem to recall that Perl is an example of a language which has made this choice. It's not a bad idea because not only does it eliminate your ambiguity, more importantly it eliminates ambiguities from programs.
I see that you don't have a compound_statement phrase structure rule in your grammar because your statement generates a phrase enclosed in { and } directly. You will have to hack that in if you take this approach.
I've posted this on the D newsgroup some months ago, but for some reason, the answer never really convinced me, so I thought I'd ask it here.
The grammar of D is apparently context-free.
The grammar of C++, however, isn't (even without macros). (Please read this carefully!)
Now granted, I know nothing (officially) about compilers, lexers, and parsers. All I know is from what I've learned on the web.
And here is what (I believe) I have understood regarding context, in not-so-technical lingo:
The grammar of a language is context-free if and only if you can always understand the meaning (though not necessarily the exact behavior) of a given piece of its code without needing to "look" anywhere else.
Or, in even less rigor:
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
So, for example, C++ fails the context-free test because the meaning of confusing<sizeof(x)>::q < 3 > (2) depends on the value of q.
So far, so good.
Now my question is: Can the same thing be said of D?
In D, hashtables can be created through a Value[Key] declaration, for example
int[string] peoplesAges; // Maps names to ages
Static arrays can be defined in a similar syntax:
int[3] ages; // Array of 3 elements
And templates can be used to make them confusing:
template Test1(T...)
{
alias int[T[0]] Test;
}
template Test2(U...)
{
alias int[U] Test2; // LGTM
}
Test1!(5) foo;
Test1!(int) bar;
Test2!(int) baz; // Guess what? It's invalid code.
This means that I cannot tell the meaning of T[0] or U just by looking at it (i.e. it could be a number, it could be a data type, or it could be a tuple of God-knows-what). I can't even tell if the expression is grammatically valid (since int[U] certainly isn't -- you can't have a hashtable with tuples as keys or values).
Any parsing tree that I attempt to make for Test would fail to make any sense (since it would need to know whether the node contains a data type versus a literal or an identifier) unless it delays the result until the value of T is known (making it context-dependent).
Given this, is D actually context-free, or am I misunderstanding the concept?
Why/why not?
Update:
I just thought I'd comment: It's really interesting to see the answers, since:
Some answers claim that C++ and D can't be context-free
Some answers claim that C++ and D are both context-free
Some answers support the claim that C++ is context-sensitive while D isn't
No one has yet claimed that C++ is context-free while D is context-sensitive :-)
I can't tell if I'm learning or getting more confused, but either way, I'm kind of glad I asked this... thanks for taking the time to answer, everyone!
Being context free is first a property of generative grammars. It means that what a non-terminal can generate will not depend on the context in which the non-terminal appears (in non context-free generative grammar, the very notion of "string generated by a given non-terminal" is in general difficult to define). This doesn't prevent the same string of symbols to be generated by two non-terminals (so for the same strings of symbols to appear in two different contexts with a different meaning) and has nothing to do with type checking.
It is common to extend the context-free definition from grammars to language by stating that a language is context-free if there is at least one context free grammar describing it.
In practice, no programming language is context-free because things like "a variable must be declared before it is used" can't be checked by a context-free grammar (they can be checked by some other kinds of grammars). This isn't bad, in practice the rules to be checked are divided in two: those you want to check with the grammar and those you check in a semantic pass (and this division also allows for better error reporting and recovery, so you sometimes want to accept more in the grammar than what would be possible in order to give your users better diagnostics).
What people mean by stating that C++ isn't context-free is that doing this division isn't possible in a convenient way (with convenient including as criteria "follows nearly the official language description" and "my parser generator tool support that kind of division"; allowing the grammar to be ambiguous and the ambiguity to be resolved by the semantic check is an relatively easy way to do the cut for C++ and follow quite will the C++ standard, but it is inconvenient when you are relying on tools which don't allow ambiguous grammars, when you have such tools, it is convenient).
I don't know enough about D to know if there is or not a convenient cut of the language rules in a context-free grammar with semantic checks, but what you show is far from proving the case there isn't.
The property of being context free is a very formal concept; you can find a definition here. Note that it applies to grammars: a language is said to be context free if there is at least one context free grammar that recognizes it. Note that there may be other grammars, possibly non context free, that recognize the same language.
Basically what it means is that the definition of a language element cannot change according to which elements surround it. By language elements I mean concepts like expressions and identifiers and not specific instances of these concepts inside programs, like a + b or count.
Let's try and build a concrete example. Consider this simple COBOL statement:
01 my-field PICTURE 9.9 VALUE 9.9.
Here I'm defining a field, i.e. a variable, which is dimensioned to hold one integral digit, the decimal point, and one decimal digit, with initial value 9.9 . A very incomplete grammar for this could be:
field-declaration ::= level-number identifier 'PICTURE' expression 'VALUE' expression '.'
expression ::= digit+ ( '.' digit+ )
Unfortunately the valid expressions that can follow PICTURE are not the same valid expressions that can follow VALUE. I could rewrite the second production in my grammar as follows:
'PICTURE' expression ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
'VALUE' expression ::= digit+ ( '.' digit+ )
This would make my grammar context-sensitive, because expression would be a different thing according to whether it was found after 'PICTURE' or after 'VALUE'. However, as it has been pointed out, this doesn't say anything about the underlying language. A better alternative would be:
field-declaration ::= level-number identifier 'PICTURE' format 'VALUE' expression '.'
format ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
expression ::= digit+ ( '.' digit+ )
which is context-free.
As you can see this is very different from your understanding. Consider:
a = b + c;
There is very little you can say about this statement without looking up the declarations of a,b and c, in any of the languages for which this is a valid statement, however this by itself doesn't imply that any of those languages is not context free. Probably what is confusing you is the fact that context freedom is different from ambiguity. This a simplified version of your C++ example:
a < b > (c)
This is ambiguous in that by looking at it alone you cannot tell whether this is a function template call or a boolean expression. The previous example on the other hand is not ambiguous; From the point of view of grammars it can only be interpreted as:
identifier assignment identifier binary-operator identifier semi-colon
In some cases you can resolve ambiguities by introducing context sensitivity at the grammar level. I don't think this is the case with the ambiguous example above: in this case you cannot eliminate the ambiguity without knowing whether a is a template or not. Note that when such information is not available, for instance when it depends on a specific template specialization, the language provides ways to resolve ambiguities: that is why you sometimes have to use typename to refer to certain types within templates or to use template when you call member function templates.
There are already a lot of good answers, but since you are uninformed about grammars, parsers and compilers etc, let me demonstrate this by an example.
First, the concept of grammars are quite intuitive. Imagine a set of rules:
S -> a T
T -> b G t
T -> Y d
b G -> a Y b
Y -> c
Y -> lambda (nothing)
And imagine you start with S. The capital letters are non-terminals and the small letters are terminals. This means that if you get a sentence of all terminals, you can say the grammar generated that sentence as a "word" in the language. Imagine such substitutions with the above grammar (The phrase between *phrase* is the one being replaced):
*S* -> a *T* -> a *b G* t -> a a *Y* b t -> a a b t
So, I could create aabt with this grammar.
Ok, back to main line.
Let us assume a simple language. You have numbers, two types (int and string) and variables. You can do multiplication on integers and addition on strings but not the other way around.
First thing you need, is a lexer. That is usually a regular grammar (or equal to it, a DFA, or equally a regular expression) that matches the program tokens. It is common to express them in regular expressions. In our example:
(I'm making these syntaxes up)
number: [1-9][0-9]* // One digit from 1 to 9, followed by any number
// of digits from 0-9
variable: [a-zA-Z_][a-zA-Z_0-9]* // You get the idea. First a-z or A-Z or _
// then as many a-z or A-Z or _ or 0-9
// this is similar to C
int: 'i' 'n' 't'
string: 's' 't' 'r' 'i' 'n' 'g'
equal: '='
plus: '+'
multiply: '*'
whitespace: (' ' or '\n' or '\t' or '\r')* // to ignore this type of token
So, now you got a regular grammar, tokenizing your input, but it understands nothing of the structure.
Then you need a parser. The parser, is usually a context free grammar. A context free grammar means, in the grammar you only have single nonterminals on the left side of grammar rules. In the example in the beginning of this answer, the rule
b G -> a Y b
makes the grammar context-sensitive because on the left you have b G and not just G. What does this mean?
Well, when you write a grammar, each of the nonterminals have a meaning. Let's write a context-free grammar for our example (| means or. As if writing many rules in the same line):
program -> statement program | lambda
statement -> declaration | executable
declaration -> int variable | string variable
executable -> variable equal expression
expression -> integer_type | string_type
integer_type -> variable multiply variable |
variable multiply number |
number multiply variable |
number multiply number
string_type -> variable plus variable
Now this grammar can accept this code:
x = 1*y
int x
string y
z = x+y
Grammatically, this code is correct. So, let's get back to what context-free means. As you can see in the example above, when you expand executable, you generate one statement of the form variable = operand operator operand without any consideration which part of code you are at. Whether the very beginning or middle, whether the variables are defined or not, or whether the types match, you don't know and you don't care.
Next, you need semantics. This is were context-sensitive grammars come into play. First, let me tell you that in reality, no one actually writes a context sensitive grammar (because parsing it is too difficult), but rather bit pieces of code that the parser calls when parsing the input (called action routines. Although this is not the only way). Formally, however, you can define all you need. For example, to make sure you define a variable before using it, instead of this
executable -> variable equal expression
you have to have something like:
declaration some_code executable -> declaration some_code variable equal expression
more complex though, to make sure the variable in declaration matches the one being calculated.
Anyway, I just wanted to give you the idea. So, all these things are context-sensitive:
Type checking
Number of arguments to function
default value to function
if member exists in obj in code: obj.member
Almost anything that's not like: missing ; or }
I hope you got an idea what are the differences (If you didn't, I'd be more than happy to explain).
So in summary:
Lexer uses a regular grammar to tokenize input
Parser uses a context-free grammar to make sure the program is in correct structure
Semantic analyzer uses a context-sensitive grammar to do type-checking, parameter matching etc etc
It is not necessarily always like that though. This just shows you how each level needs to get more powerful to be able to do more stuff. However, each of the mentioned compiler levels could in fact be more powerful.
For example, one language that I don't remember, used array subscription and function call both with parentheses and therefore it required the parser to go look up the type (context-sensitive related stuff) of the variable and determine which rule (function_call or array_substitution) to take.
If you design a language with lexer that has regular expressions that overlap, then you would need to also look up the context to determine which type of token you are matching.
To get to your question! With the example you mentioned, it is clear that the c++ grammar is not context-free. The language D, I have absolutely no idea, but you should be able to reason about it now. Think of it this way: In a context free grammar, a nonterminal can expand without taking into consideration anything, BUT the structure of the language. Similar to what you said, it expands, without "looking" anywhere else.
A familiar example would be natural languages. For example in English, you say:
sentence -> subject verb object clause
clause -> .... | lambda
Well, sentence and clause are nonterminals here. With this grammar you can create these sentences:
I go there because I want to
or
I jump you that I is air
As you can see, the second one has the correct structure, but is meaningless. As long as a context free grammar is concerned, the meaning doesn't matter. It just expands verb to whatever verb without "looking" at the rest of the sentence.
So if you think D has to at some point check how something was defined elsewhere, just to say the program is structurally correct, then its grammar is not context-free. If you isolate any part of the code and it still can say that it is structurally correct, then it is context-free.
There is a construct in D's lexer:
string ::= q" Delim1 Chars newline Delim2 "
where Delim1 and Delim2 are matching identifiers, and Chars does not contain newline Delim2.
This construct is context sensitive, therefore D's lexer grammar is context sensitive.
It's been a few years since I've worked with D's grammar much, so I can't remember all the trouble spots off the top of my head, or even if any of them make D's parser grammar context sensitive, but I believe they do not. From recall, I would say D's grammar is context free, not LL(k) for any k, and it has an obnoxious amount of ambiguity.
The grammar cannot be context-free if I need I can't tell the type of
an expression just by looking at it.
No, that's flat out wrong. The grammar cannot be context-free if you can't tell if it is an expression just by looking at it and the parser's current state (am I in a function, in a namespace, etc).
The type of an expression, however, is a semantic meaning, not syntactic, and the parser and the grammar do not give a penny about types or semantic validity or whether or not you can have tuples as values or keys in hashmaps, or if you defined that identifier before using it.
The grammar doesn't care what it means, or if that makes sense. It only cares about what it is.
To answer the question of if a programming language is context free you must first decide where to draw the line between syntax and semantics. As an extreme example, it is illegal in C for a program to use the value of some kinds of integers after they have been allowed to overflow. Clearly this can't be checked at compile time, let alone parse time:
void Fn() {
int i = INT_MAX;
FnThatMightNotReturn(); // halting problem?
i++;
if(Test(i)) printf("Weeee!\n");
}
As a less extreme example that others have pointed out, deceleration before use rules can't be enforced in a context free syntax so if you wish to keep your syntax pass context free, then that must be deferred to the next pass.
As a practical definition, I would start with the question of: Can you correctly and unambiguously determine the parse tree of all correct programs using a context free grammar and, for all incorrect programs (that the language requires be rejected), either reject them as syntactically invalid or produce a parse tree that the later passes can identify as invalid and reject?
Given that the most correct spec for the D syntax is a parser (IIRC an LL parser) I strongly suspect that it is in fact context free by the definition I suggested.
Note: the above says nothing about what grammar the language documentation or a given parser uses, only if a context free grammar exists. Also, the only full documentation on the D language is the source code of the compiler DMD.
These answers are making my head hurt.
First of all, the complications with low level languages and figuring out whether they are context-free or not, is that the language you write in is often processed in many steps.
In C++ (order may be off, but that shouldn't invalidate my point):
it has to process macros and other preprocessor stuffs
it has to interpret templates
it finally interprets your code.
Because the first step can change the context of the second step and the second step can change the context of the third step, the language YOU write in (including all of these steps) is context sensitive.
The reason people will try and defend a language (stating it is context-free) is, because the only exceptions that adds context are the traceable preprocessor statements and template calls. You only have to follow two restricted exceptions to the rules to pretend the language is context-free.
Most languages are context-sensitive overall, but most languages only have these minor exceptions to being context-free.
I'm not looking for an implementation, just pseudo-code, or at least an algorithm to handle this effectively. I need to process statements like these:
(a) # if(a)
(a,b) # if(a || b)
(a+b) # if(a && b)
(a+b,c) # same as ((a+b),c) or if((a&&b) || c)
(a,b+c) # same as (a,(b|c)) or if(a || (b&&c))
So the + operator takes precedence over the , operator. (so my + is like mathematical multiplication with , being mathematical addition, but that is just confusing).
I think a recursive function would be best, so I can handle nested parentheses nice and easy by a recursive call. I'll also take care of error handling once the function returns, so no worries there. The problems I'm having:
I just don't know how to tackle the precedence thing. I could return true as soon as I see a , and the previous value was true. Otherwise, I'll rerun the same routine. A plus would effectively be a bool multiplication (ie true*true=true, true*false=false etc...).
Error detection: I've thought up several schemes to handle the input, but there are a lot of ugly bad things I want to detect and print an error to the user. None of the schemes I thought of handle errors in a unified (read: centralized) place in the code, which would be nice for maintainability and readability:
()
(,...
(+...
(a,,...
(a,+...
(a+,...
(a++...
Detecting these in my "routine" above should take care of bad input. Of course I'll check end-of-input each time I read a token.
Of course I'll have the problem of maybe having o read the full text file if there are unmatched parenthesis, but hey, people should avoid such tension.
EDIT: Ah, yes, I forgot the ! which should also be usable like the classic not operator:
(!a+b,c,!d)
Tiny update for those interested: I had an uninformed wild go at this, and wrote my own implementation from scratch. It may not be pretty enough for the die-hards, so hence this question on codereview.
The shunting-yard algorithm is easily implementable in a relatively short amount of code. It can be used to convert an infix expression like those in your examples into postfix expressions, and evaluation of a postfix expression is Easy-with-a-capital-E (you don't strictly need to complete the infix-to-postfix conversion; you can evaluate the postfix output of the shunting yard directly and just accumulate the result as you go along).
It handles operator precedence, parentheses, and both unary and binary operators (and with a little effort can be modified to handle infix ternary operators, like the conditional operator in many languages).
Write it in yacc (bison) it becomes trivial.
/* Yeacc Code */
%token IDENTIFIER
%token LITERAL
%%
Expression: OrExpression
OrExpression: AndExpression
| OrExpression ',' AndExpression
AndExpression: NotExpression
| AndExpression '+' NotExpression
NotExpression: PrimaryExpression
| '!' NotExpression
PrimaryExpression: Identifier
| Literal
| '(' Expression ')'
Literal: LITERAL
Identifier: IDENTIFIER
%%
There's probably a better (there's definitely a more concise) description of this, but I learned how to do this from this tutorial many years ago:
http://compilers.iecc.com/crenshaw/
It's a very easy read for non-programmers too (like me). You'll need only the first few chapters.
From this wikipedia page:
The fundamental difference between
context-free grammars and parsing
expression grammars is that the PEG's
choice operator is ordered. If the
first alternative succeeds, the second
alternative is ignored. Thus ordered
choice is not commutative, unlike
unordered choice as in context-free
grammars and regular expressions.
Ordered choice is analogous to soft
cut operators available in some logic
programming languages.
Why does PEG's choice operator short circuits the matching? Is it because to minimize memory usage (due to memoization)?
I'm not sure what the choice operator is in regular expressions but let's suppose it is this: /[aeiou]/ to match a vowel. So this regex is commutative because I could have written it in any of the 5! (five factorial) permutations of the vowel characters? i.e. /[aeiou]/ behaves the same as /[eiaou]/. What is the advantage of it being commutative? (c.f. PEG's non-commutativity)
The consequence is that if a CFG is
transliterated directly to a PEG, any
ambiguity in the former is resolved by
deterministically picking one parse
tree from the possible parses. By
carefully choosing the order in which
the grammar alternatives are
specified, a programmer has a great
deal of control over which parse tree
is selected.
Is this saying that PEG's grammar is superior to CFG's?
A CFG grammar is non-deterministic, meaning that some input could result in two or more possible parse-trees. Though most CFG-based parser-generators have restrictions on the determinability of the grammar. It will give a warning or error if it has two or more choices.
A PEG grammar is deterministic, meaning that any input can only be parsed one way.
To take a classic example; The grammar
if_statement := "if" "(" expr ")" statement "else" statement
| "if" "(" expr ")" statement;
applied to the input
if (x1) if (x2) y1 else y2
could either be parsed as
if_statement(x1, if_statement(x2, y1, y2))
or
if_statement(x1, if_statement(x2, y1), y2)
A CFG-parser would generate a Shift/Reduce-conflict, since it can't decide if it should shift (read another token), or reduce (complete the node), when reaching the "else" keyword. Of course, there are ways to get around this problem.
A PEG-parser would always pick the first choice.
Which one is better is for you to decide. My opinion is that often PEG-grammars is easier to write, and CFG grammars easier to analyze.
I think you're confusing CFG with LR and with ambiguity. Grammars are not deterministic/nondeterministic, though their parsers may be. An ambiguous grammar is still CFG if it complies with the definition, and a deterministic parser can be built for it doing what PEG does.
PEGs and CFGs are two different ways of specifying a language. If you write a parser by hand, chances are very good that you will write a so-called recursive descent parser. A recursive descent parser will automatically resolve any ambiguities in your grammar, but does so silently and likely not in the way you would have wanted. The problem with this is that you never find out that there were ambiguities that got automatically resolved, unless you thoroughly test your parser. PEGs are basically a formalization of recursive descent parsers, and so come with this problem. For examples of this problem see How does backtracking affect the language recognized by a parser?, and https://cs.stackexchange.com/questions/143480/dragon-book-4-4-5-exercise/143975.
CFGs have a lot of theory to back them up, but PEGs not so much. The set of languages that can be encoded by CFG and those that can be encoded by PEG partially overlap, but neither encompasses the other.
For a more thorough review of this I recommend the excellent essay Which Parsing Approach?