Flex RegEx to find string not starting with a pattern - regex

I'm writing a lexer to scan a modified version of an INI file.
I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:
# this is a comment
var1 = "string value"
I've successfully managed to recognize these tokens forcing the # at the begging of the comment regular expression and " at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment and "string value". Instead I want this is a comment (without #) and string value (without ")
These are the regular expressions that I currently use:
[a-zA-Z][a-zA-Z0-9]* { return TOKEN_VAR_NAME; }
["][^\n\r]*["] { return TOKEN_STRING; }
[#][^\n\r]* { return TOKEN_COMMENT; }
Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =.
How could I achieve the result I want?
Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.
Correct input file example:
[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment
These are the regular expressions for the lexer:
"[" { return TOKEN::SECTION_START; }
"]" { return TOKEN::SECTION_END; }
"=" { return TOKEN::ASSIGNMENT; }
[#][^\n\r]* { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]* { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["] { *m_yylval = yytext; return TOKEN::STRING; }
And these are the syntax rules:
input : input line
| line
;
line : section
| value
| comment
;
section : SECTION_START ID SECTION_END { createNewSection($2); }
;
value : ID ASSIGNMENT STRING { addStringValue($1, $3); }
;
comment : COMMENT { addComment($1); }
;

To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.
This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex(); function.
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
"#".* ; /* ignored */
in your flex file. This makes generated scanner not return and ignore the token just read.
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
var = "data"
in front of any
[section]
line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

Related

Use alternate syntax highlighting in middle of TextMate2 comment

By the very nature of a comment, this might not make sense.
On the other hand, what I'm trying to achieve is not too different from an escape character.
As a simple example, I want
# comment :break: comment
to show up more like like
#comment
"break"
# comment
would, but without the second #, everything is on the same line, and instead of quotes I have some other escape character. Although, like quotes (and unlike escape characters that I'm familiar with [e.g., \]), I intend to explicitly indicate the beginning and the end of the interruption to the comment.
Thanks to #Graham P Heath, I was able to achieve alternate forms of comments in this question. What I'm after is an enhancement to what was achieved there. In my scenario, # is a comment in the language I'm using (R), and #' functions both as an R comment and as the start of code in another language. Now, I can get everything after the #' to take on syntax highlighting that is different from the typical R comment, but I'm trying to get a very modest amount of syntax highlighting in this sub-language (#' actually indicates the start of markdown code, and I want the "raw" syntax highlighting for text surround in a pair of ` ).
The piece of the language grammar that I'm trying to interrupt is as follows:
{ begin = '(^[ \t]+)?(?=#'' )';
end = '(?!\G)';
beginCaptures = { 1 = { name = 'punctuation.whitespace.comment.leading.r'; }; };
patterns = (
{ name = 'comment.line.number-sign-tick.r';
begin = "#' ";
end = '\n';
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
},
);
},
I'm pretty sure I've figured it out. What I didn't understand previously was how the scoping worked. I still don't understand it fully, but I now know enough to create nested definitions (regex) for the begin and end of each type of syntax.
The scoping makes things so much easier! Previously I wanted to do regex like (?<=\A#'\s.*)(\$) to find a dollar sign within the #'-style comment ... but obviously that won't work because of the repetition with * (+ wouldn't work for the same reason). Via scoping, it's already implied that we have to be inside the \A#'\s match before \$ will be matched.
Here is the relevant portion of my Language Grammar:
{ begin = '(^[ \t]+)?(?=#\'' )';
end = '(?!\G)';
beginCaptures = { 1 = { name = 'punctuation.whitespace.comment.leading.r'; }; };
patterns = (
{ name = 'comment.line.number-sign-tick.r';
begin = "#' ";
end = '\n';
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
patterns = (
// Markdown within Comment
{ name = 'comment.line.number-sign-tick-raw.r';
begin = '(`)(?!\s)'; // backtick not followed by whitespace
end = '(?<!\s)(`)'; // backtick not preceded by whitespace
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
},
// Equation within comment
{ name = 'comment.line.number-sign-tick-eqn.r';
begin = '((?<!\G)([\$]{1,2})(?!\s))';
end = '(?<!\s)([\$]{1,2})';
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
// Markdown within Equation
patterns = (
{ name = 'comment.line.number-sign-tick-raw.r';
begin = '(`)(?!\s)'; // backtick not followed by whitespace
end = '(?<!\s)(`)'; // backtick not preceded by whitespace
beginCaptures = { 0 = { name = 'punctuation.definition.comment.r'; }; };
},
);
},
);
},
);
},
here is some R code:
# below is a `knitr` (note no effect of backticks) code chunk
#+ codeChunk, include=FALSE
# normal R comment, follow by code
data <- matrix(rnorm(6,3, sd=7), nrow=2)
#' This would be recognized as markdown by `knitr::spin()`, with the preceding portion as "raw" text
`note that this doesnt go to the 'raw' format ... it is normal code!`
#+ anotherChunk
# also note how the dollar signs behave normally
data <- as.list(data)
data$blah <- "blah"
`data`[[1]] # backticks behaving
#' I can introduce a Latex-style equation, filling in values from R using `knitr` code chunks: $\frac{top}{bottom}=\frac{`r topValue`}{`r botValue`}$ then continue on with markdown.
And here is what that looks like in TextMate2 after making these changes:
Pretty good, except the backticked pieces take on the italics when they're inside an equation. I can live with that. I can even convince myself that I wanted it that way ;) (by the way, I specified fontName='regular' for the courier new, so I don't know why that's getting overridden)

Why in my flex lexer line count is incremented in one case and is not incremented in the other?

My assignment (it is not graded and I get nothing from solving it) is to write a lexer/scanner/tokenizer (however you want to call it). flex is used for this class. The lexer is written for Class Object Oriented Language or COOL.
In this language multi-line comments start and end like this:
(* line 1
line 2
line 3 *)
These comments can be nested. In other words the following is valid:
(* comment1 start (* comment 2 start (* comment 3 *) comemnt 2 end *) comment 1 end *)
Strings in this language are regular quoted strings, just like in C. Here is an example:
"This is a string"
"This is another string"
There is also an extra rule saying that there cannot be an EOF in the comment or in the string. For example the following is invalid:
(* comment <EOF>
"My string <EOF>
I wrote a lexer to handle it. It keeps track for a line count by looking for a \n.
Here is the problem that I'm having:
When lexer encounters an EOF in the comment it increments the line count by 1, however when it encounters an EOF in the string it doesn't do it.
For example when lexer encounters the following code
Line 1: (* this is a comment <EOF>
the following error is displayed:
`#2 ERROR "EOF in comment"
However when it encounters this code:
Line 1: "This is a string <EOF>
the following error is displayed:
`#1 ERROR "String contains EOF character"
I can't understand why this (line number is incremented in one case and is not incremented in the other) is happening. Below are some of the rules that I used to match comments and string. If you need more then just ask, I will post them.
<BLOCK_COMMENT>{
[^\n*)(]+ ; /* Eat the comment in chunks */
")" ; /* Eat a lonely right paren */
"(" ; /* Eat a lonely left paren */
"*" ; /* Eat a lonely star */
\n curr_lineno++; /* increment the line count */
}
/*
Can't have EOF in the middle of a block comment
*/
<BLOCK_COMMENT><<EOF>> {
cool_yylval.error_msg = "EOF in comment";
/*
Need to return to INITIAL, otherwise the program will be stuck
in the infinite loop. This was determined experimentally.
*/
BEGIN(INITIAL);
return ERROR;
}
/* Match <back slash>\n or \n */
<STRING>\\\n|\n {
curr_lineno++;
}
<STRING><<EOF>> {
/* String may not have an EOF character */
cool_yylval.error_msg = "String contains EOF character";
/*
Need to return to INITIAL, otherwise the program will be stuck
in the infinite loop. This was determined experimentally.
*/
BEGIN(INITIAL);
return ERROR;
}
So the question is
Why in the case of a comment the line number is incremented and in the case of a string it stays the same?
Any help is appreciated.
Because the pattern for a comment doesn't require a newline to exist in order to increment the line number, and the pattern for a string does.

(F)Lex : get text not matched by rules / get default output

I've read a lot about (F)Lex so far, but I couldn't find an answer.
Actually I have 2 questions, and getting the answer for one would be enough.
I have strings like:
TOTO 123 CD123 RGF 32/FDS HGGH
For each token I find, I put it in a vector. For example, for this string, I get a vector like this:
vector = TOTO, whitespace, CD, 123, whitespace, RGF, whitespace, 32, FDS, whitespace, HGGH
The "/" does not match any rules, but still, i would like to put it in my vector when I reach it and get:
vector = TOTO, whitespace, CD, 123, whitespace, RGF, whitespace, 32, /, FDS, whitespace, HGGH
So my questions are:
1) Is there a possibility to modify the default action when an input does not match any rule? (instead of print on stdout ?)
2) If it is not possible, how to catch this ? because here, "/" is an example but it can be everything ( % , C, 3, Blabblabla, etc that does not match my rules), and I can't put
.* { else(); }
cause Flex uses the regex which matches the longest string. I would like that my rules to be "sorted", and ".*" would be the last, like changing the "preferences" of Flex.
Any idea ?
The usual way is to have a rule something like
. { do_something_with_extra_char(*yytext); }
at the END of your rules. This will match any single character (other than newline -- you need a rule that matches newline somewhere too) that doesn't match any other rule. If you have multiple unmatched characters, this rule will trigger multiple times, but generally that is fine.
EDIT: I think Chris Dodd's answer is better. Here are two alternative solutions.
One solution would be to use states. When you read a single unrecognized character, enter into a different state, and build up the unrecognized token.
%{
char str[1024];
int strUsed;
%}
%x UNRECOGNIZED
%%
{SOME_RULE} {/* do processing */ }
. {BEGIN(UNRECOGNIZED); str[0] = yytext[0]; strUsed = 1; }
<UNRECOGNIZED>{bad_input} { strcpy(str+strUsed, yytext); strUsed+=yyleng; }
<UNRECOGNIZED>{good_input} { str[strUsed] = 0; vector_add(str); BEGIN(INITIAL); }
This solution works well if it's easy to write a regular expression to match "bad" input. Another solution is to slowly build up bad characters until the next valid match:
%{
char str[1024];
int strUsed = 0;
void goodMatch() {
if(strUsed) {
str[strUsed] = 0;
vector_add(str);
strUsed = 0;
}
}
%}
%%
{SOME_RULE} { goodMatch(); /* do processing */ }
. {str[strUsed++] = yytext[0]; }
Note that this requires you to modify all existing rules to add in a call to function goodMatch.
Note for both solutions: if you use a statically sized buffer, you'll have to ensure you don't overflow it on the strcpy. If you end up using a dynamically sized string, you'll have to be sure to correctly clean up memory.

flex and bison: parse string without quotes

I'm working on a mgf file parser (syntax: http://www.matrixscience.com/help/data_file_help.html) using flex + bison c + +.
I've realized the lexer (lex) and parser (yacc). But I've a problem that I can't solve : when I try to parse strings.
Important : there is no ' or " around the string.
Here is an example of input:
CHARGE=1+, 2+ and 3+
#some comments
BEGIN IONS
TITLE= Cmpd 1, +MSn(417.2108), 10.0 min //line 20
PEPMASS=417.21083 35173
CHARGE=3+
123.79550 20
285.16455 56
302.14335 146 1+
[other datas ...]
END IONS
BEGIN IONS
[an other one ... ]
Here the (minimal) lexer:
MGF_TOKEN_DEBUG is juste a macro to print a line
#define MGF_TOKEN_DEBUG(val) std::cout<<"token: "<<val<<std::endl
\n {
MGF_TOKEN_DEBUG("T_EOL");
return token::T_EOL;
}
^[#;!/][^\n]* {
MGF_TOKEN_DEBUG("T_COMMENT");
return token::T_COMMENT;
}
[[:space:]] {}
/** values **/
[0-9]+ {
MGF_TOKEN_DEBUG("V_INTEGER"<<" (="<<yytext<<")");
return token::V_INTEGER;
}
[0-9]+"."[0-9]* {
MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
return token::V_DOUBLE;
}
[0-9]+("."[0-9]+)?[eE][+-][0-9]+ {
MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
return token::V_DOUBLE;
}
"+" {
MGF_TOKEN_DEBUG("T_PLUS");
return token::T_PLUS;
}
"=" {
MGF_TOKEN_DEBUG("T_EQUALS");
return token::T_EQUALS;
}
"," {
MGF_TOKEN_DEBUG("T_COMA");
return token::T_COMA;
}
"and" {
MGF_TOKEN_DEBUG("T_AND");
return token::T_AND;
}
/*** keywords */
^"CHARGE" {
MGF_TOKEN_DEBUG("K_CHARGE");
return token::K_CHARGE;
}
^"TITLE" {
MGF_TOKEN_DEBUG("K_TITLE");
return token::K_TITLE;
}
[ others keywords ...]
/**** string : problem here **/
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])* {
MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");
return token::V_STRING;
}
And the (minimized) parser.
start : headerparams blocks T_END;
headerparams : /* empty */| headerparams headerparam;
headerparam : K_CHARGE T_EQUALS charge_list T_EOL | [others ...];
blocks : /* empty */ | blocks block;
block : T_BEGIN_IONS T_EOL blockparams ions T_END_IONS T_EOL| T_BEGIN_IONS T_EOL blockparams T_END_IONS T_EOL;
blockparam : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS V_STRING T_EOL | [others...];
ion : number number T_EOL| number number charge T_EOL;
ions : ions ion| ion;
number : V_INTEGER | V_DOUBLE;
charge : V_INTEGER T_PLUS | V_INTEGER T_MINUS;
charge_list : charge| charge_list T_COMA charge | charge_list T_AND charge;
My problem is that I get the next token:
[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (= Cmpd)
token: V_INTEGER (= 1)
Error line 20: syntax error, unexpected integer, expecting end of line
I would like to have:
[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (Cmpd 1, +MSn (417.2108), 10.0 min)
token: T_EOL
If someone can help me ...
Edit #1
I've "solve" the problem using the concatenation of tokens:
lex:
[A-Za-z][^\n[:space:]+-=,]* {
MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")"))
return token::V_STRING;t
}
yacc:
string_st : V_STRING
| string_st V_STRING
| string_st number
| string_st T_COMA
| string_st T_PLUS
| string_st T_MINUS
;
blockparam : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS string_st T_EOL | [others...];
if your string will alway start with some text TITLE and end with some text \n (new line char)
I would suggest you to use start conditions,
%x IN_TITLE
"TITLE" { /* return V_STRING of TITILE in c++ code */ BEGIN(IN_TITLE); }
<IN_TITLE>= { /* return T_EQUALS in c++ code */; }
<IN_TITLE>"\n" { BEGIN(INITIAL); }
<IN_TITLE>.* { MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");return token::V_STRING; }
%x IN_TITLE defines the IN_TITLE state, and the pattern text TITLE will make it start. Once it's started, \n will have it go back to the initial state (INITIAL is predefined), and every other characters will just be consumed to V_STRING without any particular action.
Your basic problem is a simple typo:
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])*
should be:
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space:]])*
^
You don't actually need the | operator. The following is perfectly legal (but probably not what you want either; see below):
[A-Za-z][[:space:]:;,()A-Za-z0-9_.-]*
Once you fix that, you'll find that you have another problem: your keywords (TITLE, for example) will be lexed as STRING because the STRING pattern is longer. (In fact, since [:space:] includes \n, the STRING pattern will probably extend to the end of the input. You probably wanted [:blank:].)
I took a quick glance at the description of the format you're trying to parse, but it's not a very precise description. But it appears that parameter lines have the format:
^[[:alpha:]]+=.*$
Perhaps the :alpha: should be :alnum: or even something more permissive; as I said, the description wasn't very precise. What was clear is that:
The keyword is case-insensitive, so both TITLE and title will work identically, and
The = sign is obligatory and may not have a space on either side of it. (So your TITLE= line is not correct, but maybe it doesn't matter).
In order to not interfere with parsing of the data, you might want to make the above a single "token" whose value is the part after the = and whose type corresponds to the (case-normalized) keyword. Of course, each parameter-type may require an idiosyncratic value parser, which could only be achieved in flex by use of start conditions. In any event, you should think about the consequences of stray characters in the TITLE which are not part of the STRING pattern, and how you propose to deal with the resulting lexical error.
Your code does not make it clear how you communicate text values from your lexer to your parser. You need to be aware that the value of yytext is only safe inside of the lexer action for the token it corresponds to. The next call to the lexer will invalidate it, and bison parsers almost always have a lookahead token, so the lexer will have been called again before the token is processed. Consequently, you must copy yytext in order to pass it to the parser, and the parser needs to take ownership of the copy so that you don't end up leaking memory.

OCamllex matching beginning of line?

I am messing around writing a toy programming language in OCaml with ocamllex, and was trying to make the language sensitive to indentation changes, python-style, but am having a problem matching the beginning of a line with ocamllex's regex rules. I am used to using ^ to match the beginning of a line, but in OCaml that is the string concat operator. Google searches haven't been turning up much for me unfortunately :( Anyone know how this would work?
I'm not sure if there is explicit support for zero-length matching symbols (like ^ in Perl-style regular expressions, which matches a position rather than a substring). However, you should be able to let your lexer turn newlines into an explicit token, something like this:
parser.mly
%token EOL
%token <int> EOLWS
% other stuff here
%%
main:
EOL stmt { MyStmtDataType(0, $2) }
| EOLWS stmt { MyStmtDataType($1 - 1, $2) }
;
lexer.mll
{
open Parser
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip other blanks *)
| ['\n'][' ']+ as lxm { EOLWS(String.length(lxm)) }
| ['\n'] { EOL }
(* ... *)
This is untested, but the general idea is:
Treat newlines as staetment 'starters'
Measure whitespace that immediately follows the newline and pass its length as an int
Caveat: you will need to preprocess your input to start with a single \n if it doesn't contain one.