Regex BNF Grammar - regex

Is there any BNF grammar for regular expression?

You can see one for Perl regexp (displayed a little more in detail here, as posted by edg)

To post them on-site:
CMPT 384 Lecture Notes Robert D. Cameron November 29 - December 1,
1999
BNF Grammar of Regular Expressions
Following the precedence rules given previously, a BNF grammar for Perl-style regular expressions can be constructed as follows.
<RE> ::= <union> | <simple-RE>
<union> ::= <RE> "|" <simple-RE>
<simple-RE> ::= <concatenation> | <basic-RE>
<concatenation> ::= <simple-RE> <basic-RE>
<basic-RE> ::= <star> | <plus> | <elementary-RE>
<star> ::= <elementary-RE> "*"
<plus> ::= <elementary-RE> "+"
<elementary-RE> ::= <group> | <any> | <eos> | <char> | <set>
<group> ::= "(" <RE> ")"
<any> ::= "."
<eos> ::= "$"
<char> ::= any non metacharacter | "\" metacharacter
<set> ::= <positive-set> | <negative-set>
<positive-set> ::= "[" <set-items> "]"
<negative-set> ::= "[^" <set-items> "]"
<set-items> ::= <set-item> | <set-item> <set-items>
<set-items> ::= <range> | <char>
<range> ::= <char> "-" <char>
via VonC.
--- Knud van Eeden --- 21 October 2003 - 03:22 am --------------------
PERL:Search/Replace:Regular Expression:Backus Naur Form:What is
possible BNF for regular expression?
expression = term
term | expression
term = factor
factor term
factor = atom
atom metacharacter
atom = character
.
( expression )
[ characterclass ]
[ ^ characterclass ]
{ min }
{ min , }
{ min , max }
characterclass = characterrange
characterrange characterclass
characterrange = begincharacter
begincharacter - endcharacter
begincharacter = character
endcharacter = character
character =
anycharacterexceptmetacharacters
\ anycharacterexceptspecialcharacters
metacharacter = ?
* {=0 or more, greedy}
*? {=0 or more, non-greedy}
+ {=1 or more, greedy}
+? {=1 or more, non-greedy}
^ {=begin of line character}
$ {=end of line character}
$` {=the characters to the left of the match}
$' {=the characters to the right of the match}
$& {=the characters that are matched}
\t {=tab character}
\n {=newline character}
\r {=carriage return character}
\f {=form feed character}
\cX {=control character CTRL-X}
\N {=the characters in Nth tag (if on match side)}
$N{=the characters in Nth tag (if not on match side)}
\NNN {=octal code for character NNN}
\b {=match a 'word' boundary}
\B {=match not a 'word' boundary}
\d {=a digit, [0-9]}
\D {=not a digit, [^0-9]}
\s {=whitespace, [ \t\n\r\f]}
\S {=not a whitespace, [^ \t\n\r\f]}
\w {='word' character, [a-zA-Z0-9_]}
\W {=not a 'word' character, [^a-zA-Z0-9_]}
\Q {=put a quote (de-meta) on characters, until \E}
\U {=change characters to uppercase, until \E}
\L {=change characters to uppercase, until \E}
min = integer
max = integer
integer = digit
digit integer
anycharacter = ! " # $ % & ' ( ) * + , - . / :
; < = > ? # [ \ ] ^ _ ` { | } ~
0 1 2 3 4 5 6 7 8 9
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
---
[book: see also: Bucknall, Julian - the Tomes of Delphi: Algorithms
and Datastructures - p. 37 - 'Using regular expressions' -
http://www.amazon.com/exec/obidos/tg/detail/-
/1556227361/qid=1065748783/sr=1-1/ref=sr_1_1/002-0122962-7851254?
v=glance&s=books]
---
---
Internet: see also:
---
Compiler: Grammar: Expression: Regular: Which grammar defines set of
all regular expressions? [BNF]
http://www.faqts.com/knowledge_base/view.phtml/aid/25950/fid/1263
---
Perl Regular Expression: Quick Reference 1.05
http://www.erudil.com/preqr.pdf
---
Top: Computers: Programming: Languages: Regular Expressions: Perl
http://dmoz.org/Computers/Programming/Languages/Regular_Expressions/Per
l/
---
TSE: Search/Replace:Regular Expression:Backus Naur Form:What is
possible BNF for regular expression?
http://www.faqts.com/knowledge_base/view.phtml/aid/25714/fid/1236
---
Delphi: Search: Regular expression: Create: How to create a regular
expression parser in Delphi?
http://www.faqts.com/knowledge_base/view.phtml/aid/25645/fid/175
---
Delphi: Search: Regular expression: How to add regular expression
searching to Delphi? [Systools]
http://www.faqts.com/knowledge_base/view.phtml/aid/25295/fid/175
----------------------------------------------------------------------
via Ed Guinness.

http://web.archive.org/web/20090129224504/http://faqts.com/knowledge_base/view.phtml/aid/25718/fid/200

Related

capture in a string everything that is not a token

Context: I am dealing with a mix of boolean and arithmetic expressions that may look like in the following example:
b_1 /\ (0 <= x_1) /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))))
I want to match and extract any constraint of the shape A <= B contained in the formula which must be always true. In the above example, only 0 <= x_1 would satisfy such criterion.
Current Goal:
My idea is to build a simple parse tree of the input formula focusing only on the following tokens: and (/\), or (\/), left bracket (() and right bracket ()). Given the above formula, I would like to generate the following AST:
/\
|_ "b_1"
|_ /\
|_ "0 <= x_1"
|_ \/
|_ "x_2 <= 2"
|_ /\
|_ "b_3"
|_ "(/ 1 3) <= x_4"
Then, I can simply walk through the AST and discard any sub-tree rooted at \/.
My Attempt:
Looking at this documentation, I am defining the grammar for the lexer as follows:
import ply.lex as lex
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
t_AND = r'\/\\'
t_OR = r'\\\/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_ignore = ' \t\n'
def t_error(t):
print(t)
print("Illegal character '{}'".format(t.value[0]))
t.lexer.skip(1)
def t_STRING(t):
r'^(?!\)|\(| |\t|\n|\\\/|\/\\)'
t.value = t
return t
data = "b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"
lexer = lex.lex()
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print(tok.type, tok.value, tok.lineno, tok.lexpos)
However, I get the following output:
~$ python3 lex.py
LexToken(error,'b_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,0)
Illegal character 'b'
LexToken(error,'_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,1)
Illegal character '_'
LexToken(error,'1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,2)
Illegal character '1'
AND /\ 1 4
LPAREN ( 1 7
LexToken(error,'x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,8)
Illegal character 'x'
LexToken(error,'_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,9)
Illegal character '_'
LexToken(error,'2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,10)
Illegal character '2'
LexToken(error,'<= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,12)
Illegal character '<'
LexToken(error,'= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,13)
Illegal character '='
LexToken(error,'2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,15)
Illegal character '2'
OR \/ 1 17
LPAREN ( 1 20
LexToken(error,'b_3 /\\ ((/ 1 3) <= x_4))',1,21)
Illegal character 'b'
LexToken(error,'_3 /\\ ((/ 1 3) <= x_4))',1,22)
Illegal character '_'
LexToken(error,'3 /\\ ((/ 1 3) <= x_4))',1,23)
Illegal character '3'
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
LexToken(error,'/ 1 3) <= x_4))',1,30)
Illegal character '/'
LexToken(error,'1 3) <= x_4))',1,32)
Illegal character '1'
LexToken(error,'3) <= x_4))',1,34)
Illegal character '3'
RPAREN ) 1 35
LexToken(error,'<= x_4))',1,37)
Illegal character '<'
LexToken(error,'= x_4))',1,38)
Illegal character '='
LexToken(error,'x_4))',1,40)
Illegal character 'x'
LexToken(error,'_4))',1,41)
Illegal character '_'
LexToken(error,'4))',1,42)
Illegal character '4'
RPAREN ) 1 43
RPAREN ) 1 44
The t_STRING token is not correctly recognized as it should.
Question: how to set the catch all regular expression for t_STRING so as to get a working tokenizer?
Your regular expression for T_STRING most certainly doesn't do what you want. What it does do is a little more difficult to answer.
In principle, it consists only of two zero-length assertions: ^, which is only true at the beginning of the string (unless you provide the re.MULTILINE flag, which you don't), and a long negative lookahead assertion.
A pattern which consists only of zero-length assertions can only match the empty string, if it matches anything at all. But lexer patterns cannot be allowed to match the empty string. Lexers divide the input into a series of tokens, so that every character in the input belongs to some token. Each match -- and they are all matches, not searches -- starts precisely at the end of the previous match. So if a pattern could match the empty string, the lexer would try the next match at the same place, with the same result, which would be an endless loop.
Some lexer generators solve this problem by forcing a minimum one-character match using a built-in catch-all error pattern, but Ply simply refuses to generate a lexer if a pattern matches the empty string. Yet Ply does not complain about this lexer specification. The only possible explanation is that the pattern cannot match anything.
The key is that Ply compiles all patterns using the re.VERBOSE flag, which allows you to separate items in regular expressions with whitespace, making the regexes slightly less unreadable. As the Python documentation indicates:
Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>.
Whitespace includes newlines and even comments (starting with a # character), so you can split patterns over several lines and insert comments about each piece.
We could do that, in fact, with your pattern:
def t_STRING(t):
r'''^ # Anchor this match at the beginning of the input
(?! # Don't match if the next characters match:
\) | # Close parenthesis
\( | # Open parenthesis
\ | # !!! HERE IS THE PROBLEM
\t | # Tab character
\n | # Newline character
\\\/ | # \/ token
\/\\ # /\ token
)
'''
t.value = t
return t
So as I added whitespace and comments to your pattern, I had to notice that the original pattern attempted to match a space character as an alternative with | |. But since the pattern is compiled as re.VERBOSE, that space character is ignored, leaving an empty alternative, which matches the empty string. That alternative is part of a negative lookahead assertion, which means that the assertion will fail if the string to match at that point starts with the empty string. Of course, every string starts with the empty string, so the negative lookahead assertion always fails, explaining why Ply didn't complain (and why the pattern never matches anything).
Regardless of that particular glitch, the pattern cannot be useful because, as mentioned already, a lexer pattern must match some characters, and so a pattern which only matches the empty string cannot be useful. What we want to do is match any character, providing that the negative lookahead (corrected, as below) allows it. So that means that the negative lookahead assertion show be followed with ., which will match the next character.
But you almost certainly don't want to match just one character. Presumably you wanted to match a string of characters which don't match any other token. So that means putting the negative lookahead assertion and the following . into a repetition. And remember that it needs to be a non-empty repetition (+, not *), because patterns must not have empty matches.
Finally, there is absolutely no point using an anchor assertion, because that would limit the pattern to matching only at the beginning of the input, and that is certainly not what you want. It's not at all clear what it is doing there. (I've seen recommendations which suggest using an anchor with a negative lookahead search, which I think are generally misguided, but that discussion is out of scope for this question.)
And before we write the pattern, let's make one more adjustment: in a Python regular expression, if you can replace a set of alternatives with a character class, you should do so because it is a lot more efficient. That's true even if only some of the alternatives can be replaced.
So that produces the following:
def t_STRING(t):
r'''(
(?! # Don't match if the next characters match:
[() \t\n] | # Parentheses or whitespace
\\\/ | # \/ token
\/\\ # /\ token
) . # If none of the above match, accept a character
)+ # and repeat as many times as possible (at least once)
'''
return t
I removed t.value = t. t is a token object, not a string, and the value should be the string it matched. If you overwrite the value with a circular reference, you won't be able to figure out which string was matched.
This works, but not quite in the way you intended. Since whitespace characters are excluded from T_STRING, you don't get a single token representing (/ 1 3) <= x_4. Instead, you get a series of tokens:
STRING b_1 1 0
AND /\ 1 4
LPAREN ( 1 7
STRING x_2 1 8
STRING <= 1 12
STRING 2 1 15
OR \/ 1 17
LPAREN ( 1 20
STRING b_3 1 21
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
STRING / 1 30
STRING 1 1 32
STRING 3 1 34
RPAREN ) 1 35
STRING <= 1 37
STRING x_4 1 40
RPAREN ) 1 43
RPAREN ) 1 44
But I think that's reasonable. How could the lexer be able to tell that the parentheses in (x_2 <= 2 and (b_3 are parenthesis tokens, while the parentheses in (/ 1 3) <= x_4 are part of T_STRING? That determination will need to be made in your parser.
In fact, my inclination would be to fully tokenise the input, even if you don't (yet) require a complete tokenisation. As this entire question and answer shows, attempting to recognised "everything but..." can actually be a lot more complicated than just recognising all tokens. Trying to get the tokeniser to figure out which tokens are useful and which ones aren't is often more difficult than tokenising everything and passing it through a parser.
Based on the excellent answer from #rici, pointing the problem with t_STRING, this is my final version of the example that introduces smaller changes to the one proposed by #rici.
Code
##############
# TOKENIZING #
##############
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
def t_AND(t):
r'[ ]*\/\\[ ]*'
t.value = "/\\"
return t
def t_OR(t):
r'[ ]*\\\/[ ]*'
t.value = "\\/"
return t
def t_LPAREN(t):
r'[ ]*\([ ]*'
t.value = "("
return t
def t_RPAREN(t):
r'[ ]*\)[ ]*'
t.value = ")"
return t
def t_STRING(t):
r'''(
(?! # Don't match if the next characters match:
[()\t\n] | # Parentheses or whitespace
\\\/ | # \/ token
\/\\ # /\ token
) . # If none of the above match, accept a character
)+ # and repeat as many times as possible (at least once)
'''
return t
def t_error(t):
print("error: " + str(t.value[0]))
t.lexer.skip(1)
import ply.lex as lex
lexer = lex.lex()
data = "b_b /\\ (ccc <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))"
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print("{0}: `{1}`".format(tok.type, tok.value))
Output
STRING: `b_b `
AND: `/\`
LPAREN: `(`
STRING: `ccc <= 2 `
OR: `\/`
LPAREN: `(`
STRING: `b_3 `
AND: `/\`
LPAREN: `(`
LPAREN: `(`
STRING: `/ 1 3`
RPAREN: `)`
STRING: `<= x_4`
RPAREN: `)`
RPAREN: `)`

Reluctant matching in ANTLR 4.4

Just as the reluctant quantifiers work in Regular expressions I'm trying to parse two different tokens from my input i.e, for operand1 and operator. And my operator token should be reluctantly matched instead of greedily matching input tokens for operand1.
Example,
Input:
Active Indicator in ("A", "D", "S")
(To simplify I have removed the code relevant for operand2)
Expected operand1:
Active Indicator
Expected operator:
in
Actual output for operand1:
Active indicator in
and none for the operator rule.
Below is my grammar code:
grammar Test;
condition: leftOperand WHITESPACE* operator;
leftOperand: ALPHA_NUMERIC_WS ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC_WS: WORD ( WORD| DIGIT | WHITESPACE )* ( WORD | DIGIT)+ ;
WHITESPACE : (' ' | '\t')+;
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
One solution to this would be to not produce one token for several words but one token per word instead.
Your grammar would then look like this:
grammar Test;
condition: leftOperand operator;
leftOperand: ALPHA_NUMERIC+ ;
operator: EQUALS | NOT_EQUALS | IN | NOT_IN;
EQUALS : '=';
NOT_EQUALS : '!=';
IN : 'in';
NOT_IN : 'not' WHITESPACE 'in';
WORD: (LOWERCASE | UPPERCASE )+ ;
ALPHA_NUMERIC: WORD ( WORD| DIGIT)* ;
WHITESPACE : (' ' | '\t')+ -> skip; // ignoring WS completely
fragment DIGIT: '0'..'9' ;
LOWERCASE : [a-z] ;
UPPERCASE : [A-Z] ;
Like this the lexer will not match the whole input as ALPHA_NUMERIC_WS once the corresponding lexer rule has been entered because any occuring WS forces the lexer to leave the ALPHA_NUMERIC rule. Therefore any following input will be given a chance to be matched by other lexer-rules (in the order they are defined in the grammar).

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

Changing how Antlr4 prints a tree

I am using Antlr4 in IntelliJ to make a small compiler for arithmetic expressions.
I want to print the tree and use this code snippet to do so.
JFrame frame = new JFrame("Tree");
JPanel panel = new JPanel();
TreeViewer viewr = new TreeViewer(Arrays.asList(
parser.getRuleNames()),tree);
viewr.setScale(2);//scale a little
panel.add(viewr);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setSize(400,400);
frame.setVisible(true);
This makes a tree which looks like this, for the input 3*5\n
Is there a way to adjust this so it reads from top to bottom
Statement
Expression /n
INT * INT
3 5
instead?
My grammar is defined as:
grammar Expression;
statement: expression ENDSTATEMENT # printExpr
| ID '=' expression ENDSTATEMENT # assign
| ENDSTATEMENT # blank
;
expression: expression MULTDIV expression # MulDiv
| expression ADDSUB expression # AddSub
| INT # int
| FLOAT # float
| ID # id
| '(' expression ')' # parens
;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
MULTDIV : ('*' | '/'); //match multiply or divide
ADDSUB : ('+' | '-'); //match add or subtract
FLOAT: INT '.' INT; //match a floating point number
ENDSTATEMENT:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WHITESPACE : [ \t]+ -> skip ; // ignore whitespace
No, not without changing the source of the tree viewer yourself.

Context-free-grammar to represent regular expressions

I'm trying to make a context-free-grammar to represent simple regular expressions. The symbols that I want is [0-9][a-z][A-Z], and operators is "|", "()" and "." for concatenation, and for sequences for now I only want "*" later I will add "+","?", etc. I tried this grammar in javacc:
void RE(): {}
{
FINAL(0) ( "." FINAL(0) | "|" FINAL(0))*
}
void FINAL(int sign): { Token t; }
{
t = <SYMBOL> {
if ( sign == 1 )
jjtThis.val = t.image + "*";
else
jjtThis.val = t.image;
}
| FINAL(1) "*"
| "(" RE() ")"
}
The problem is in FINAL function the line | FINAL(1) "*" that gives me a error Left recursion detected: "FINAL... --> FINAL.... Putting "*" on the left of FINAL(1) resolve the problem but this is not what I want..
I already tried to read the article from wikipedia to remove left recursion but I really don't know how to do it, can someone help? :s
The following takes care of the left recursion
RE --> FACTOR ("." FINAL | "|" FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
However, that won't give . precedence over | . For that you can do the following
RE --> TERM ("|" TERM)*
TERM --> FINAL ("." FINAL)*
FINAL --> PRIMARY ( "*" )*
PRIMARY --> <SYMBOL> | "(" RE ")"
The general rule is
A --> A b | c | d | ...
can be transformed to
A --> B b*
B --> c | d | ...
where B is a new nonnterminal.