Convert a dfa to rule for a asterisk case - regex

Here is a simple but very common grammar rule case in EBNF format, the Statements is a none terminal symbol and Statement is none terminal symbol:
Statements ::= (Statement ';')*
After converting this rule to NFA, and then do the subset contruction for converting the NFA to DFA, and at last get the dfa:
State0 -> Statement -> State1 -> ';' ->State0
State0 -> ε -> State0
The State0 is the DFA's start state representing the none terminal symbol Statements, also it is the finish state.
From State0 input Statement and traslate to State1 and input ';' at State1, translate to State0.
Also, State0 could translate to self with the ε.
And after converting the above dfa to regular grammar following the algorithm in dragon book, i get the following grammar rules:
Statements -> ε
Statements -> Statement Extend_NT
Extend_NT -> ';' Statements
It added the new none terminal symbol Extend_NT, but i want to get the following the regular grammars which does not contain the extend symbol Extend_NT:
Statements -> ε
Statements -> Statement ';' Statements
So the question is that is there any algorithm could get the above result that does not contain the new none terminal symbol Extend_NT?
Or it is just a engineering problem?

I'm not really sure I understand your question.
If you just have a single production for a non-terminal and that production just has a single repetition operator, either at the beginning or the end, you can apply a simple desugaring: (Here α and β are sequences of grammar symbols (but not EBNF symbols), and α might be empty.)
N ::= α (β)* ⇒ N → α | N β
N ::= (β)* α ⇒ N → α | β N
If α is empty, you could use either of the above. For an LALR(1) parser generator, the left-recursive version would be the usual choice; if you're building an LL parser, you will of course want the right-recursive version.
If there is more than one EBNF operator in the right-hand side, then you will need extra non-terminals, one for each EBNF repetition operator, except possibly one.
I don't know whether you'd call that an algorithm. I personally think of it as engineering, but the difference is not absolute.

Related

Making an antlr4 parser rule that cannot have any skipped characters

I am trying to write an antlr4 grammar for a customized language which among its lexer rules originally contained the following:
PLUS : '+' ;
MINUS : '-' ;
NUMBER: ('+'|'-')? [0-9]+ ;
COMMENT : '/*' (COMMENT|~'*'|('*' ~'/'))* '*/' -> skip ;
WS : (' ' | '\t' | '\n') -> skip ;
The parser grammar contains, among other things, an arithmetic expression evaluator, and what I found is that using these lexer rules failed to parse the input '2-2' correctly, which should come out as NUMBER MINUS NUMBER, and instead just returned two NUMBER tokens. I therefore broke out the unary + and - applications into it's own parser rule, as follows:
literal_number : NUMBER
| '-' NUMBER
| '+' NUMBER ;
And defined NUMBER simply as:
NUMBER: [0-9]+ ;
However, with this arrangement, the literal_number parser rule is being activated even if there is whitespace or comments between the plus and minus tokens and the number itself. This should not be valid in parser contexts where I am expecting to see only an integer constant (which is actually anywhere other than when parsing arithmetic expressions). I have another parser rule elsewhere in my parser grammar that handles unary negation already, so I do not need to replicate that in the literal_number parser rule anyways, so all I what I want is for the literal_number parser rule to refer only to places in the text where a real integer constant had been found.
How can I do this? I have already looked at questions on stackoverflow pertaining to rules that are sensitive to whitespace, but I have not been able to figure out how to apply any of those solutions to my problem.
I'm not sure that this matters for my question, but my target language is c++, although I expect I may still be able to generalize from a java-specific example if one is offered.
EDIT:
The response that I've seen so far highlights an issue with my original comment which may have been ambiguous. In my defense I had not wanted to complicate my original question with information that I did not immediately see as relevant, but in light of the response I've seen so far, I can now clearly see that it is. I can only offer my apologies for this initial oversight.
In addition to the literal_number rule, I also have the following rule for expressions, which, in particular, has a rule allowing for negation.
expression : ID # look up value
| literal_number # number
| MINUS expression # negate
| expression (STAR|SLASH) expression # multiply
| expression (PLUS|MINUS) expression # add
;
So to that end, the expression 2-2 should evaluate as literal_number (2) MINUS literal_number (2), 2--2 should evaluate as literal_number (2) MINUS literal_number (-2), while 2-- 2 should evaluate as literal_number (2) MINUS MINUS literal_number (2).
So basically, as I said originally, I only want the literal_number rule to be used at all when the NUMBER is by itself or MINUS and the NUMBER are side by side with no ignored tokens between them, but I cannot just make ('+'|'-') [0-9] a lexical rule for NUMBER without causing the problem I had in the first place.

How can I find regular grammar for L* if we are given a grammar for language L?

Is there any general method to do so? For example, we have a general method to find grammar for L1 U L2 by adding a production S-> S1 | S2 where S1 and S2 are start symbols for grammars of L1 and L2 respectively.
Thanks in advance..
In general, given a grammar G such that L(G) = L', there is no algorithm which always produces a regular grammar G' such that L(G') = (L')*. For starters, (L')* may not be a regular language. Even if you allow the procedure to recognize this case and print "not a regular language" in such a case, this cannot be generally possible since it would allow us to determine whether arbitrary unrestricted grammars generate particular strings (the construction is not too hard but I won't provide it unless desired). This is an undecidable problem, so we can't recognize regular languages in unrestricted grammars.
Perhaps your question is whether there is a neat construction to do this if initially given a regular grammar. In that case, the answer is a definite and clear, "yes!" Here is one easily described (though possibly inefficient in practice) procedure for doing just that:
Convert the regular grammar into a nondeterministic finite automaton using the typical construction for doing so. There are easy constructions for left-regular and right-regular grammars.
Construct a regular expression from the nondeterministic finite automaton using any known construction. One such construction is typically used in proving equivalence.
Construct a new regular expression which is the Kleene closure of the one from the last step.
Construct a nondeterministic finite automaton from the regular expression from the last step, using a standard construction.
Construct a regular grammar from the nondeterministic finite automaton from the last step. There are known constructions for this.
Thus, we can mechanically go from regular grammar for L to regular grammar for L*.
If you just want ANY grammar for L*, the simplest would probably be to introduce a new start state S' and productions S' := S'S' | S where S is the start symbol of your input grammar. This obviously does not give a regular grammar, however - if the input grammar generates a regular language, this one will do so as well.
Example: given the regular grammar
S := 0S | 1T
T := 0S | 1T | 1
A construction gives us this nondeterministic finite automaton:
q s q'
- - -
S 0 S
S 1 T
T 0 S
T 1 T
T 1 (H)
A construction gives us the regular expression:
(0*1)(0*1)*1
The Kleene closure of this is:
((0*1)(0*1)*1)*
We recognize from the standard construction that this automaton is equivalent:
q s q'
- - -
(I) - S
S 0 S
S 1 T
T 0 S
T 1 T
T 1 H
H - (I)
Whence the following regular grammar:
I := S | -
S := 0S | 1T
T := 0S | 1T | H
H := I

What's it for ';;' in OCaml?

This is my simple OCaml code to print out a merged list.
let rec merge cmp x y = match x, y with
| [], l -> l
| l, [] -> l
| hx::tx, hy::ty ->
if cmp hx hy
then hx :: merge cmp tx (hy :: ty)
else hy :: merge cmp (hx :: tx) ty
let rec print_list = function
| [] -> ()
| e::l -> print_int e ; print_string " " ; print_list l
;;
print_list (merge ( <= ) [1;2;3] [4;5;6]) ;;
I copied the print_list function from Print a List in OCaml, but I had to add ';;' after the function's implementation in order to avoid this error message.
File "merge.ml", line 11, characters 47-57:
Error: This function has type int list -> unit
It is applied to too many arguments; maybe you forgot a `;'.
My question is why ';;' is needed for print_list while merge is not?
The ;; is, in essence, a way of marking the end of an expression. It's not necessary in source code (though you can use it if you like). It is useful in the toplevel (the OCaml REPL) to cause the evaluation of what you've typed so far. Without such a symbol, the toplevel has no way to know whether you're going to type more of the expression later.
In your case, it marks the end of the expression representing the body of the function print_list. Without this marker, the following symbols look like they're part of the same expression, which leads to a type error.
For top-level expressions in OCaml files, I prefer to write something like the following:
let () = print_list (merge ( <= ) [1;2;3] [4;5;6])
If you code this way you don't need to use the ;; token in your source files.
This is an expansion of Jeffrey's answer.
As you know, when doing language parsing, a program has to break the flow in manageable lexical elements, and expect these so called lexemes (or tokens) to follow certain syntactic rules, allowing to regroup lexemes in larger units of meaning.
In many languages, the largest syntactic element is the statement, which diversifies in instructions and definitions. In these same languages, the structure of the grammar requires a special lexeme to indicate the end of some of these units, usually either instructions or statements. Others use instead a lexeme to separate instructions (rather than end each of them), but it's basically the same idea.
In OCaml, the grammar follows patterns which, within the algorithm used to parse the language, permits to elide such instruction terminator (or separator) in various circumstances. The keyword let for instance, is always necessary to introduce a new definition. This can be used to detect the end of the preceding statement at the outermost level of program statements (the top level).
However, you can easily see the problem it induces: the interactive version of Ocaml always need the beginning of a new statement to be able to figure out the previous one, and a user would never be able to provide statements one by one! The ;; special token was thus defined(*) to indicate the end of a top level statement.
(*): I actually seem to recall that the token was originally in OCaml ancestors, but then made optional when OCaml was devised; however, don't quote me on that.

Regular expression for a grammar

I'm reading finite automata & grammars from the compiler construction of Aho and I'm stuck with this grammar for so long. I don't have a clear perception of how I can describe it:
Consider the following Grammar:
S -> (L) | a L -> L, S | S
Note that the parentheses and comma are actually terminals in this
language and appear in the sentences accepted by this grammar. Try to
describe the language generated by this grammar. Is this grammar
ambiguous?
My concern here is: Can language generated by this grammar be described as regular expressions? I'm confused about how to do it. Any help?
To show that the grammar is ambiguous, you need to be able to construct two different parse trees while parsing the same string. Your string will be comprised of "(", ")", ",", and "a", since those are the only terminal symbols in the grammar.
Try arranging those 4 terminal symbols in a few ways and see if you can show different successful parses, in the spirit of the example ambiguous grammar on Wikipedia.
Immediate left recursion tends to cause problems for some parsers. See if "a,a,a" does anything interesting on "L → L , S | S"...
my concern here is language generated by this grammar as regular expression can it be described...i'am confused about how to do
A regular expression can not fully describe the grammar. Rewriting part of the grammar will make this more apparent:
S → ( L )
S → a
L → L , S
L → S
Pay attention to #1 and #4. L can produce S, and S can produce ( L ). This means S can produce ( S ), which can produce ( ( S ) ), ( ( ( S ) ) ), etc. ad infinitum. The key thing is that those parentheses are matched; there are the same amount of "(" symbols as ")" symbols.
A regex can't do that.
Regular expressions map to finite automata. Finite automata can not count. A language L ∈ {w: 0n 1n} is not a regular. L ∈ {w: (n )n}, just being a substiution of "(" for "0" and ")" for "1", isn't either. See: the first examples section under Regular Languages - Wikipedia. (Notation note: s1 is s, s2 is ss, ..., sn is s repeated n times.)
This means you can't use a regex to describe that part of the language. That puts it in the domain of CFGs, Turing Machines, and pushdown automata.
Regular expressions (and a library to interpret them) are a poor tool for recognizing sentences of a context-free grammar. Instead, you would want to use a parser generator like yacc, bison, or ANTLR.
I think the point of the exercise in Aho's book is to "describe the language" in words, in order to understand whether it is ambiguous. One way to approach it: can you devise a grammatical sentence that can be parsed in two different ways, given the productions of the grammar? If so, the grammar is ambiguous.

Is this context-free grammar a regular expression?

I have a grammar defined as follows:
A -> aA*b | empty_string
Is A a regular expression? I'm confused as to how to interpret a BNF grammar.
No, this question doesn't actually have to do with regular expressions. Context-free grammars specify languages that can't be described by regular expressions.
Here, A is a non-terminal; that is, it's a symbol that must be expanded by a production rule. Given that you only have one rule, it is also your start symbol - any production in this grammar must start with an A.
The production rule is
(1) A -> aA*b |
(2) empty_string
a and b are terminal symbols - they are in the alphabet of the language, and cannot be expanded. When you have no more nonterminals on the left-hand side, you are done.
So this language specifies words that are like balanced parentheses, except with a and b instead of ( and ).
For instance, you could produce ab as follows:
A -> aA*b (using 1)
aAb -> ab (using 2)
Similarly, you could produce aabb:
A -> aA*b (1)
aAb -> aaA*bb (1)
aaAbb -> aabb (2)
Or even aababb:
A -> aA*b
aA*b -> aabA*b:
aaba*b -> aababA*b:
aababA*b: -> aababb
Get it? The star symbol may be a bit confusing because you have seen it in regular expressions, but actually it does the same thing here as there. It is called a Kleene-closure and it represents all words you can make with 0 or more As.
Regular Expressions generate Regular languages and can be parsed with State Machines.
BNF grammars are Context Free Grammars which generate Context Free languages and can be be parsed with Push Down Automata (stack machines)
Context Free Grammars can do everything Regular Grammars can and more.
A appears to be a BNF grammar rule. I'm not really sure why you have this confused with a regular expression. Are you confused because it has a * in it? Everything that has a * isn't a regular expression.