i get project from
https://github.com/cchuahuico/COOL-Compiler
when compile this code:
class Main inherits SuperClass {
main(): Object {{
out_string("Enter an integer greater-than or equal-to 0: ");
let input: Int in
if input < 0 then
input
-- out_string("ERROR: Number must be greater-than or equal-to 0\n")
else {
-- out_string("The factorial of ").out_int(input);
-- out_string(" is ").out_int(factorial(input));
factorial(input);
}
fi;
}};
factorial(num: Int): Int {
if num = 0 then 1 else num * factorial(num - 1) fi
};
};
class SuperClass {
out_string(str:String){};
};
when compile this with mingw i have this error
<stdin>:2:error:syntax error near or aat character or token '{'
<stdin>:5:error:syntax error near or aat character or token 'let'
<stdin>:14:error:syntax error near or aat character or token 'fi'
<stdin>:15:error:syntax error near or aat character or token '}'
<stdin>:23:error:syntax error near or aat character or token '('
<stdin>:24:error:syntax error near or aat character or token '}'
<stdin>:24:error:syntax error near or aat character or token ' '
copmilation halted due to lexical or syntax errors
The mingw compiler is complaining because the language is not standard C++:
The inherits keyword. Although, it could be #define as
anything.
fi is not a keyword.
The if statement needs parens () around the expression.
let is not a keyword.
then is not a keyword.
// There are a lot more
Is this another language?
Related
I am working on a Regx parser for RegEx inside XSD.
My previous problem was descrived here: ANTLR4 parsing RegEx
I have split the Lexer and Parser since than.
Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside.
This is my lexer grammar:
lexer grammar RegExLexer;
Char : ALPHA ;
Int : DIGIT ;
LBrack : '[' ;//-> pushMode(modeRange) ;
RBrack : ']' ;//-> popMode ;
LBrace : '(' ;
RBrace : ')' ;
Semi : ';' ;
Comma : ',' ;
Asterisk: '*' ;
Plus : '+' ;
Dot : '.' ;
Dash : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe : '|' ;
Esc : '\\' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;
And here is the example:
[0-9a-z()]+
I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice.
I have read the reference about this and I still don't get what i should do.
How do I implement the modes?
Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:
lexer grammar RegexLexer;
START_CHAR_CLASS
: '[' -> pushMode(CharClass)
;
START_GROUP
: '('
;
END_GROUP
: ')'
;
PLAIN_ATOM
: ~[()\[\]]
;
mode CharClass;
END_CHAR_CLASS
: ']' -> popMode
;
CHAR_CLASS_ATOM
: ~[\r\n\\\]]
| '\\' .
;
After generating the lexer, you can use the following class to test it:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
}
}
}
And if you run this Main class, the follwoing will be printed to your console:
START_GROUP (
START_CHAR_CLASS [
CHAR_CLASS_ATOM (
CHAR_CLASS_ATOM )
CHAR_CLASS_ATOM \]
END_CHAR_CLASS ]
END_GROUP )
As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.
You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.
I have this code that has an error and can't find it.
if (Lista.at(i).getStartHour() <= temp->getStartHour() &&
Lista.at(i).getEndHour() => temp->getEndHour() &&
Lista.at(i).getStartMinute() < temp->getStartMinute() &&
Lista.at(i).getEndMinute() > temp->getEndMinute())
I get this error:
error: expected primary-expression before '>' token at that line.
I can't see what I'm doing wrong.
Lista is a vector of objects, same object as temp. All functions return int. I'm trying to check if those times overlap.
=> is not a token; it's two tokens, = and >.
The greater-than-or-equal-to operator is >=.
this is my bison code:
%}
%union {
int int_val;
}
%left '+' '-'
%nonassoc '(' ')'
%token INTEGER PRINT
%type <int_val> expr_int INTEGER
%%
program: command '\n' { return 0; }
;
command: print_expr
;
print_expr: PRINT expr_int { cout<<$2<<endl; }
expr_int: expr_int '+' expr_int { $$ = $1 + $3; }
| expr_int '-' expr_int { $$ = $1 - $3; }
| '(' expr_int ')' { $$ = $2; }
| INTEGER
;
and this is the flex code:
%}
INTEGER [1-9][0-9]*|0
BINARY [-+]
WS [ \t]+
BRACKET [\(\)]
%%
print{WS} { return PRINT; }
{INTEGER} { yylval.int_val=atoi(yytext); return INTEGER; }
{BINARY}|\n { return *yytext; }
{BRACKET} { return *yytext; }
{WS} {}
. { return *yytext; }
%%
//////////////////////////////////////////////////
int yywrap(void) { return 1; } // Callback at end of file
Invalid inputs for the program are:
print 5
output:
5
input:
print (1+1)
output:
2
But for some reason, for the following inputs I do not get immediate error:
print (1+1))
output:
2
some error
input:
print 5!
output:
5
some error
I would like an error to be printed immediately, without commiting the print command and then throwing an error.
How should I change the program so it will not print errornous inputs?
Download the "flex & bison" book by John Levine or the "bison" manual from gnu. Both contain an infix calculator that you can reference.
The grammar you have written " '(' expr_int ')'" reduces to expr_int before the grammatically incorrect ')' in "(1 + 1))' is detected. That is the parser does:
(1 + 1)) => ( expr_int )) => expr_int)
and then sees the error. In order to capture the error you have to change the parser to see the error before the reduction, and you have to do it for all errors that you want treated. Therefore you would write (in this case):
expr_int '(' expr_int ')' ')' { this is an error message }
The short answer, after the long answer, is that it is impractical to generate a parser containing instances of all possible errors. What you have is fine for what you are doing. What you should explore is how to (gracefully) recover from an error rather than abandoning parsing.
Your "program" and "command" non-terminals can be combined as:
program: print-expr '\n' { return 0; }
On a separate note, your regular expressions can be rewritten to good effect as:
%%
INTEGER [0-9]+
WS [ \t]+
%%
print/{WS} { return PRINT; }
{INTEGER} { yylval.int_val=atoi(yytext); return INTEGER; }
'(' { return '('; }
')' { return ')'; }
'+' { return '+'; }
'-' { return '-'; }
{WS}* {}
\n { return '\n'; }
. { return *yytext; } // do you really want to do this?
%%
Create an end-of-line token (eg ;) for your language and make all lines statements exactly at the point when they encounter this end-of-line token.
Well that is because you are executing the code while parsing it. The good old bison calculator is meant to teach you how to write a grammar, not implement a full compiler/interpreter.
The normal way to build compiler/interpreter is the following:
lexer -> parser -> semantic analyser -> code generator -> interpreter
Granted building a fully fledged compiler may be overkill in your case. What you need to to is store the result somewhere and only output it after yyparse has returned without an error.
In yacc/bison, the code associated with a semantic action is executed as soon as the rule in question is reduced, which may happen immediately after the tokens for the rule have been shifted, before looking at any following context to see if there's and error or not (a so-called "default" reduction, used to make the parse tables smaller).
If you want to avoid printing an answer until an entire line is read (and recognized), you need to include the newline in the rule that has the action that prints the message. In your case, you can move the newline from the program rule to the print_expr rule:
program: command { return 0; } ;
print_expr: PRINT expr_int '\n' { cout<<$2<<endl; }
Of course, this will still give you an error (after printing output) if you give it multiple lines of input.
The input
1 -- Narrowing Variable Initialization
2
3 function main a: integer returns integer;
4 b: integer is a * 2.;
5 begin
6 if a <= 0 then
7 b + 3;
8 else
9 b * 4;
10 endif;
11 end;
is yielding the output
1 -- Narrowing Variable Initialization
2
3 function main a: integer returns integer;
4 b: integer is a * 2.;
5 begin
Narrowing Variable Initialization
6 if a <= 0 then
7 b + 3;
8 else
9 b * 4;
10 endif;
11 end;
Instead of placing that error message under line 4, which is where the error actually occurs. I've looked at it for hours and can't figure it out.
%union
{
char* ident;
Types types;
}
%token <ident> IDENTIFIER
%token <types> INTEGER_LITERAL
%token <types> REAL_LITERAL
%token BEGIN_
%token FUNCTION
%token IS
%token <types> INTEGER
%token <types> REAL
%token RETURNS
%type <types> expression
%type <types> factor
%type <types> literal
%type <types> term
%type <types> statement
%type <types> type
%type <types> variable
%%
program:
/* empty */ |
functions ;
functions:
function_header_recovery body ; |
function_header_recovery body functions ;
function_header_recovery:
function_header ';' |
error ';' ;
function_header:
FUNCTION {locals = new Locals();} IDENTIFIER optional_parameters RETURNS type {globals->insert($3,locals->tList);} ;
optional_parameters:
/* empty */ |
parameters;
parameters:
IDENTIFIER ':' type {locals->insert($1, $3); locals->tList.push_back($3); } |
IDENTIFIER ':' type {locals->insert($1, $3); locals->tList.push_back($3); } "," parameters;
type:
INTEGER | REAL ;
body:
optional_variables BEGIN_ statement END ';' ;
optional_variables:
/* empty */ |
variables ;
variables:
variable IS statement {checkTypes($1, $3, 2);} |
variable IS statement {checkTypes($1, $3, 2);} variables ;
variable:
IDENTIFIER ':' type {locals->insert($1, $3);} {$$ = $3;} ;
statement:
expression ';' |
...
Types checkTypes(Types left, Types right, int flag)
{
if (left == right)
{
return left;
}
if (flag == 1)
{
Listing::appendError("Conditional Expression Type Mismatch", Listing::SEMANTIC);
}
else if (flag == 2)
{
if (left < right)
{
Listing::appendError("Narrowing Variable Initialization", Listing::SEMANTIC);
}
}
return REAL_TYPE;
}
printing being handled by:
void Listing::nextLine()
{
printf("\n");
if (error == "")
{
lineNo++;
printf("%4d%s",lineNo," ");
}
else
{
printf("%s", error.c_str());
error = "";
nextLine();
}
}
void Listing::appendError(const char* errText, int errEnum)
{
error = error + errText;
if (errEnum == 997)
{
lexErrCount++;
}
else if (errEnum == 998)
{
synErrCount++;
}
else if (errEnum == 999)
{
semErrCount++;
}
}
void Listing::display()
{
printf( "\b\b\b\b\b\b " );
if (lexErrCount + synErrCount + semErrCount > 0)
{
printf("\n\n%s%d","Lexical Errors ",lexErrCount);
printf("\n%s%d","Syntax Errors ",synErrCount);
printf("\n%s%d\n","Semantic Errors ",semErrCount);
}
else
{
printf("\nCompiled Successfully\n");
}
}
That's just the way bison works. It produces a one-token lookahead parser, so your production actions don't get triggered until it has read the token following the production. Consequently, begin must be read before the action associated with variables happens. (bison never tries to combine actions, even if they are textually identical. So it really cannot know which variables production applies and which action to execute until it sees the following token.)
There are various ways to associate a line number and/or column position with each token, and to use that information when an error message is to be produced. Interspersing the errors and/or warnings with the input text, in general, requires buffering the input; for syntax errors, you only need to buffer until the next token but that is not a general solution; in some cases, for example, you may want to associate an error with an operator, for example, but the error won't be detected until the operator's trailing argument has been parsed.
A simple technique to correctly intersperse errors/warnings with source is to write all the errors/warnings to a temporary file, putting the file offset at the front of each error. This file can then be sorted, and the input can then be reread, inserting the error messages at appropriate points. The nice thing about this strategy is that it avoids having to maintain line numbers for each error, which noticeably slows down lexical analysis. Of course, it won't work so easily if you allow constructs like C's #include.
Because generating good error messages is hard, and even tracking locations can slow parsing down quite a bit, I've sometimes used the strategy of parsing input twice if an error is detected. The first parse only detects errors and fails early if it can't do anything more reasonable; if an error is detected, the input is reparsed with a more elaborate parser which carefully tracks file locations and possibly even uses heuristics like indentation depth to try to produce better error messages.
As rici notes, bison produces an LALR(1) parser, so it uses one token of lookahead to know what action to take. However, it doesn't ALWAYS use a token of lookahead -- in some cases (where there's only one possibility regardless of lookahead), it uses default reductions which can reduce a rule (and run the associated action) WITHOUT lookahead.
In your case, you can take advantage of that to get the action to run without lookahead if you really need to. The particular rule in question (which triggers the requirement for lookahead) is:
variables:
variable IS statement {checkTypes($1, $3, 2);} |
variable IS statement {checkTypes($1, $3, 2);} variables ;
in this case, after seeing a variable IS statement, it needs to see the next token to decide if there are more variable declarations in order to know which action (the first or the second) to run. But as the two actions are really the same, you could combine them into a single action:
variables: vardecl | vardecl variables ;
vardecl: variable IS statement {checkTypes($1, $3, 2);}
which would end up using a default reduction as it doesn't need the lookahead to decide between two reductions/actions.
Note that the above depends on being able to find the end of a statement without lookahead, which should be the case as long as all statements end unambiguously with a ;
I am teaching myself to use JavaCC in a hobby project, and have a simple grammar to write a parser for. Part of the parser includes the following:
TOKEN : { < DIGIT : (["0"-"9"]) > }
TOKEN : { < INTEGER : (<DIGIT>)+ > }
TOKEN : { < INTEGER_PAIR : (<INTEGER>){2} > }
TOKEN : { < FLOAT : (<NEGATE>)? <INTEGER> | (<NEGATE>)? <INTEGER> "." <INTEGER> | (<NEGATE>)? <INTEGER> "." | (<NEGATE>)? "." <INTEGER> > }
TOKEN : { < FLOAT_PAIR : (<FLOAT>){2} > }
TOKEN : { < NUMBER_PAIR : <FLOAT_PAIR> | <INTEGER_PAIR> > }
TOKEN : { < NEGATE : "-" > }
When compiling with JavaCC I get the output:
Warning: Regular Expression choice : FLOAT_PAIR can never be matched as : NUMBER_PAIR
Warning: Regular Expression choice : INTEGER_PAIR can never be matched as : NUMBER_PAIR
I'm sure this is a simple concept but I don't understand the warning, being a novice in both parser generation and regular expressions.
What does this warning mean (in as-novice-as-you-can-get terms)?
I don't know JavaCC, but I am a compiler engineer.
The FLOAT_PAIR rule is ambiguous. Consider the following text:
0.0
This could be FLOAT 0 followed by FLOAT .0; or it could be FLOAT 0. followed by FLOAT 0; both resulting in FLOAT_PAIR. Or it could be a single FLOAT 0.0.
More importantly, though, you are using lexical analysis with composition in a way that is never likely to work. Consider this number:
12345
This could be parsed as INTEGER 12, INTEGER 345 resulting in an INTEGER_PAIR. Or it could be parsed as INTEGER 123, INTEGER 45, another INTEGER_PAIR. Or it could be INTEGER 12345, another token. The problem exists because you are not requiring white space between the lexical elements of the INTEGER_PAIR (or FLOAT_PAIR).
You should almost never try to handle pairs like this in the lexer. Instead, you should handle plain numbers (INTEGER and FLOAT) as tokens, and handle things like negation and pairing in the parser, where whitespace has been dealt with and stripped.
(For example, how are you going to process "----42"? This is a valid expression in most programming languages, which will correctly calculate multiple negations, but would not be handled by your lexer.)
Also, be aware that single-digit integers in your lexer will not be matched as INTEGER, they will come out as DIGIT. I don't know the correct syntax for JavaCC to fix that for you, though. What you want is to define DIGIT not as a token, but simply something you can use in the definitions of other tokens; alternatively, embed the definition of DIGIT ([0-9]) directly wherever you are using DIGIT in your rules.
I haven't used JavaCC, but it is possible that NUMBER_PAIR is ambiguous.
I think the problem comes down to the fact that the same exact thing can be matched as both FLOAT_PAIR and INTEGER_PAIR since FLOAT can match an INTEGER.
But this is just a guess having never seen the JavaCC syntax :)
It probably means that for every FLOAT_PAIR you'll just get a FLOAT_PAIR token, never a NUMBER_PAIR token. The FLOAT_PAIR rule already matches all the input and JavaCC will not try to find further matching rules. That would be my interpretation, but I don't know JavaCC, so take it with a grain of salt.
Maybe you can specify somehow that NUMBER_PAIR is the main production and that you don't want to get any other tokens as results.
Thanks to Barry Kelly's answer, the solution I've come up with is:
SKIP : { < #TO_SKIP : " " | "\t" > }
TOKEN : { < #DIGIT : (["0"-"9"]) > }
TOKEN : { < #DIGITS : (<DIGIT>)+ > }
TOKEN : { < INTEGER : <DIGITS> > }
TOKEN : { < INTEGER_PAIR : (<INTEGER>) (<TO_SKIP>)+ (<INTEGER>) > }
TOKEN : { < FLOAT : (<NEGATE>)?<DIGITS>"."<DIGITS> | (<NEGATE>)?"."<DIGITS> > }
TOKEN : { < FLOAT_PAIR : (<FLOAT>) (<TO_SKIP>)+ (<FLOAT>) > }
TOKEN : { < #NUMBER : <FLOAT> | <INTEGER> > }
TOKEN : { < NUMBER_PAIR : (<NUMBER>) (<TO_SKIP>)+ (<NUMBER>) >}
TOKEN : { < NEGATE : "-" > }
I had completely forgot to include the space which is used to separate the two tokens, I've also used the '#' symbol which stops the tokens being matched, and is just used in the definition of other tokens. The above is compiled by JavaCC without warning or error.
However, as noted by Barry, there are reasons against doing this.