I am trying to write a compiler for COOL language and am right now at lexical analysis. Concretely, Flex matches the largest pattern as I understand.
Thus if you have in Flex:
class A inherits B
Now if my token for class is returned by following pattern:
^"class" return CLASS;
For my inherits token:
^"class"[ ]+[a-zA-Z]+[0-9]?[ ]+"inherits"[ ]+ return INHERITS;
Now since flex matches the largest pattern, it will always return INHERITS and never class. Is there a work around to this problem?
I can here return token for class alone. But how do I return token for inherits since it MUST be preceded by a class token and its name followed by another string token?
But if I try to impose constraints on inherits, then flex will match the largest pattern not class alone.
Then should I return the enums/number for class identifier individually? And if I do that, how do I identify 'inherits' identifier?
EDIT:
class A inherits B {
main(): SELF_TYPE{...}
}
How does the flex match against main? My reflexer differentiates between TypeID which is A and main, which it declares ObjectID. The only it can do that is by looking ahead at the paranthesis and if it finds (, it declares an ObjectID. But if I do that, I counter the same problem as above: flex will never match against ( but always main(.
You are trying to do too much in Flex, and perhaps you misunderstand the role and boundaries of the lexical phase. You shouldn't be attempting to parse the whole sentence with a Flex regex alone. Flex's job is to consume a stream of text, and convert it to a stream of integer tokens. The sentence you've provided:
class A inherits B
represents multiple tokens from a language that requires parsing. Flex is not a parser, it is a lexical scanner/tokenizer. (Technically it is a parser of bytes or characters, but you want to "parse" atomic units that represent the words of your language, not characters).
So there are 4 distinct tokens (atomic units), also known as TERMINALS in the above sentence: [CLASS, A, INHERITS, B].
You need an IDENTIFIER rule for Flex, such that anything that doesn't match a token, falls through to an IDENTIFIER, so the tokens returned by Flex to the parser are:
CLASS IDENTIFIER INHERITS IDENTIFIER
The job for Flex is to parse each word / token and convert the text to distinct integer values to be consumed by Bison or any other parser.
You typically have a Yacc/Bison BNF grammar to handle:
class_decl:
CLASS IDENTIFIER
| CLASS IDENTIFIER INHERITS IDENTIFIER
;
So your Lex rule would be thus, and you need to return the IDENTIFIER token to parser, while attaching the actual symbol (A, B). You get that from the yytext variable:
LETTER [a-zA-Z_]
DIGIT [0-9]
LETTERDIGIT [a-zA-Z0-9_]
%%
"class" return(CLASS);
"inherits" return(INHERITS);
{LETTER}{LETTERDIGIT}* {
yylval.sym = new Symbol(yytext);
yylval.sym->line = line;
fprintf(stderr, "TOKEN IDENTIFIER(%s)\n", yytext);
return(IDENTIFIER);
}
If you are really trying to do all of this within Flex, then it is possible, but you will end up with a mess, like if you try to parse HTML with regex... :)
Related
I would be grateful if someone could explain how the following regex should be interpreted; it is from the W3C reference for Namespaces in XML 1.0, and defines an NCName ([4]) as:
Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
I can understand subtraction when applied to lists, such as:
[a-z-[aeiuo]] representing the list of all consonants (see http://www.regular-expressions.info/charclasssubtract.html), but not when applied to a group (apologies if this is the wrong term) as shown above.
The comment indicates how I should interpret the regex, but I'm struggling; why not just:
Name - ( ':' )
if the intention is for NCName to be Name minus ':', then why are the zero or more characters required on either side (I'm not asking a separate question, just indicating my area of confusion)?
Please accept my thanks in advance.
The documents published by W3C use a variant of the EBNF Notation to describe the languages standardized by them.
It is described in section "6 Notation" of the XML Recommendation.
The example you posted:
NCName ::= Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
How to read it:
NCName is the object described by the rule;
::= separates the name of the described object (on the left) by the expression that describes it (on the right);
Name is an object already described by another rule;
- is the except symbol; A - B in EBNF means "matches A but doesn't match B";
(...) - the parentheses create a group; they make the expression inside them behave as a single item;
Char is another object already described by another rule in the documentation; it basically means a Unicode character;
* - repetition, matches the previous item zero or more times;
':' - string in single or double quotes is a string literal; it represents itself; here, the colon character;
Put together, it means a NCName is a Name that doesn't contain :.
The comment seems incorrect (or maybe it is just bad worded).
I'm trying to write a lexer to parse a file like that looks this:
one.html /two/
one/two/ /three
three/four http://five.com
Each line has two strings separated by a space. I need to create two regex patterns: one to match the first string, and another to match the second string.
This is my attempt at the regex for the lexer (a file named lexer.l to be run by flex):
%%
(\S+)(?:\s+\S+) { printf("FIRST %s\n", yytext); }
(?:\S+\s+)(\S+) { printf("SECOND %s\n", yytext); }
. { printf("Mystery character %s\n", yytext); }
%%
I have tested both (\S+)(?:\s+\S+) and (?:\S+\s+)(\S+) in the Regex101 tester and they both seem to be working properly: https://regex101.com/r/FQTO15/1
However, when i try to build the lexer by running flex lexer.l, I get an error:
lexer.l:3: warning, rule cannot be matched
This is referring to the second rule I have. If I attempt to reverse the order of the rules, I get the error on the second one yet again. If I only leave in one of the rules, it works perfectly fine.
I believe this issue has to do with the fact that both regexes are similar and of the same length, so flex sees it as ambiguous, even though the two regexes capture different things (but they match the same things?).
Is there anything I can do with the regex so that it will capture/match what I want without clashing with each other?
EDIT: More Test Examples
one.html /two/
one/two.html /three/four/
one /two
one/two/ /three
one_two/ /three
one%20two/ /three
one/two/ /three/four
one/two /three/four/five/
one/two.html http://three.four.com/
one/two/index.html http://three.example.com/four/
one http://two.example.com/three
one/two.pdf https://example.com
one/two?query=string /three/four/
go.example.com https://example.com
EDIT
It turns out that the regex engine used by flex is rather limited. It cannot do grouping and it also doesn't seem to use \s for spaces.
So this wouldn't work:
^.*\s.*$
But this does:
^.*" ".*$
Thanks to #fossil for all their help.
Although there are ways to solve your problem as stated, I think you would be better off understanding the intended use of (f)lex, and to find a solution consistent with its processing model.
(F)lex is intended to split an input into individual tokens. Each token has a type, and it is expected that it is possible to figure out the type of a token simply by looking at it (and not at its context). The classic model of a token type are the objects in a computer program, where we have, for example, identifiers, numbers, certain keywords, and various operators. Given an appropriate set of rules, a (f)lex scanner will take an input like
a = b*7 + 2;
and produce a stream of tokens:
identifier = identifier * number + number ;
Each of these tokens has an associated "semantic value" (which not all of them actually require), so that the two identifier tokens and the two number are not just anonymous blobs.
Note that a and b in the above line have different roles. a is being assigned to, while b is being referred to. But that's not relevant to their form, and it is not evident from their form. They are just tokens. Figuring out what they mean and their relationship with each other is the role of a parser, which is a separate part of the parsing model. The intention of the two-phase scan/parse paradigm is to simplify both tasks by abstracting away complications: the scanner knows nothing about context or meaning, while the parser can deduce the logical structure of the input without concerning itself with the messy details of representation and irrelevant whitespace.
In many ways, your problem is a bit outside of this paradigm, in part because the two token types you have cannot be distinguished on the basis of their appearance alone. If they have no useful internal structure, though, then you could just accept that your input consists of
"paths", which do not contain whitespace, and
newline characters.
You could then use a combination of a lexer and a parser to break the input into lines:
File splitter.l
%{
#include "splitter.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
File splitter.y
%code {
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror(const char* msg);
}
%code requires {
#define YYSTYPE char*
}
%token PATH
%%
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
%%
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
int main(int argc, char** argv) {
return yyparse();
}
Quite a lot of the above is boiler-plate; it's worth concentrating just on the grammar and the token patterns.
The grammar is very simple:
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
The interesting line is the last one, which says that a line consists of two PATHs. That handles each line by printing it out, although you'd probably want to do something different. It is this line which understands that the first word on a line and the second word on the same line have different functions. Note that it doesn't need the lexer to label the two words as "FIRST" and "SECOND", since it can see that all by itself :)
The two calls to free release the memory allocated by strdup in the lexer, thus avoiding a memory leak. In a real application, you'd need to make sure you don't free the strings until you don't need them any more.
The lexer patterns are also very simple:
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
The first one returns a special single-character token, a newline character, to for the end-of-line token. The second one matches any string of non-whitespace characters. ((F)lex doesn't know about GNU regex extensions, so it doesn't have \s and friends. It does, however, have the much more readable Posix character classes, which are listed in the flex manual, among other places. The third pattern skips any whitespace. Since \n was already handled by the first pattern, it cannot be matched here (which is why this pattern is a single whitespace character and not a repetition.)
In the second pattern, we assign a value to yylval, which is the semantic value of the token. (We don't do this elsewhere because the newline token doesn't need a semantic value.) yylval always has type YYSTYPE, which we have arranged to be char* by a #define. Here, we just set it from yytext, which is the string of characters (f)lex has just matched. It is important to make a copy of this string because yytext is part of the lexer's internal structure, and its value will change without warning. Having made a copy of the string, we are then obliged to ensure that the memory is eventually released.
To try this program out:
bison -o splitter.tab.c -d splitter.y
flex -o splitter.lex.c splitter.l
gcc -Wall -O2 -o splitter splitter.tab.c splitter.lex.c
I would like to create a grammar rule for a printable character (any character which returns true using C isprint() function.
For this purpose i created the following regex rule inside my lex file:
[\x20-\x7E] { yylval.ch = strdup(yytext); return CHARACTER; }
The regular expression contains all the printable characters based on their ASCII hexadecimal value.
On my first attempt this rule was located in the bottom, but any printable character that was already stated before obviously wasn't included, for example if my input was the character '+' and i had a previous rule:
"+" { return PLUS_OPERATOR; }
The parser accepted it as a PLUS_OPERATOR and not as CHARACTER.
Than i tried to place the character rule on top of my scanner, and from the same reason as before - all the following rules with characters in the printable range could not be matched.
My question is what can i do to create a rule that will match all printable characters but also rules for specific characters.
The only thing that i can think of is to putt it on the bottom and use a grammar rule with all one-character regular expression rules and the character rule (ex. CHAR : PLUS_OPERATOR | MINUS_OPERATOR | EQUAL_OPERATOR | CHARACTER)
I have a lot more than 3 one character rules in my lex file so obviously i'm looking for a more elegant solution.
The only solution is the one you propose: create a non-terminal which is the union of all the relevant terminals.
Personally, I find grammars much more readable if single-character tokens are written as themselves, so I would write:
printable: '+' | '-' | '=' | CHAR
in the bison file, and in the scanner:
[-+=] { yylval.ch = yytext[0]; return yylval.ch; }
[[:print:]] { yylval.ch = yytext[0]; return CHAR; }
(which in turn requires the semantic type to be a union with both char and char* fields; the advantage is that you don't need to worry about freeing the strings created for operator characters.)
That is about as elegant as it gets, I'm afraid.
I'm very interested in Syntax Definitions for Sblime text 2
I've studied the basics but I don't know how to write RE (and rules) for smth like
variable = sentense, i.e. myvar = func(foo, bar) + baz
I can't write anything better than ^\s*([^=\n]+)=([^=\n]+\n) (that doesn't work)
How to write this RE in proper way?
Also, i have some difficulties in defining RE for block
IF i FROM .. TO ..
...
ELSE
...
END IF
Hoe to write it?
In this case you have to write a parser. A regex won't work because the patterns may vary. You've already noticed it when you stated 'variable = sentence'.
For this, you can use spoofax or javacup for grammar definitions. I'll give you a snip in JavaCup:
Scanner issues: suppose 'variable' follows the pattern: (_|[a-zA-Z])(_|[a-zA-Z])*
and 'number' is: ([0-9])+
Note that number could be any decimal or int, but here I state it as that pattern, supposing my language only deals with integer (or whatever that pattern means :) ).
Now we can declare our grammar following the JavaCUP syntax. Which is more or less like:
expression ::= variable "=" sentence
sentence ::= sentence "+" sentence;
sentence ::= sentence "-" sentence;
sentence ::= sentence "*" sentence;
sentence ::= sentence "/" sentence;
sentence ::= number;
...and that goes further.
If you've never had any compiler's class, it may seems very difficult to see. Plus there is a lot of grammar's restrictions to avoiding infinity loop in the parser, depending on which you're using (RL or LL).
Anyway, the real answer for your question is: you can't do what you want only with regex, i'll need more concepts.
My situation: I'm new to Spirit, I have to use VC6 and am thus using Spirit 1.6.4.
I have a line that looks like this:
//The Description;DESCRIPTION;;
I want to put the text DESCRIPTION in a string if the line starts with //The Description;.
I have something that works but looks not that elegant to me:
vector<char> vDescription; // std::string doesn't work due to missing ::clear() in VC6's STL implementation
if(parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+~ch_p(';'))[assign(vDescription)]
),
// End grammar
space_p).hit)
{
const string desc(vDescription.begin(), vDescription.end());
}
I would much more like to assign all printable characters up to the next ';' but the following won't work because parse(...).hit == false
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+print_p)[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)
How do I make it hit?
You might try using confix_p:
confix_p(as_lower_d["//the description;"],
(+print_p)[assign(vDescription)],
ch_p(';')
)
It should be equivalent to Fred's response.
The reason your code fails is because print_p is greedy. The +print_p parser will consume characters until it encounters the end of the input or a non-printable character. Semicolon is printable, so print_p claims it. Your input gets exhausted, the variable is assigned, and the match fails — there's nothing left for the last semicolon of your parser to match.
Fred's answer constructs a new parser, (print_p - ';'), which matches everything print_p does, except for semicolons. "Match everything except X, and then match X" is a common pattern, so confix_p is provided as a shortcut for constructing that kind of parser. The documentation suggests using it for parsing C- or Pascal-style comments, but that's not required.
For your code to work, Spirit would need to recognize that the greedy print_p matched too much and then backtrack to allow matching less. But although Spirit will backtrack, it won't backtrack to the "middle" of what a sub-parser would otherwise greedily match. It will backtrack to the next "choice point," but your grammar doesn't have any. See Exhaustive backtracking and greedy RD in the Spirit documentation.
You're not getting a hit because ';' is matched by print_p. Try this:
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+(print_p-';'))[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)