The Regex Rule of Flex - regex

I am confused about the rule of flex lexer
My lexer can recognize the decimal and hex, but when I want to make a union for both of them as the Integer.
flex tells me it's test.l:13: unrecognized rule
here's my lexer file:
test.l
%{
#include <stdio.h>
#include <string.h>
int yylval;
%}
digit [0-9]
decimal ^({digit}|[1-9]{digit}+)$
hex 0[xX][0-9a-fA-F]+
integer {hex}|{decimal}
%%
{integer} {printf("integer - %s \n", yytext);}
%%
// run function
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}

Why do you think you need to anchor your decimal pattern? The way it is written, it will only match a number which is by itself on a line, without even any white space.
Anyway, it's the anchor which creates a problem. In (f)lex, ^ can only appear at the beginning of a pattern, and the macro expansion of {hex}|{decimal} has the ^ in the middle.
Changing it to {decimal}|{hex} won't help because flex normally surrounds macro expansions with parentheses to avoid incorrect operator grouping. (The parentheses are not inserted if the macro ends with $, but the immediate replacement body of {integer} doesn't end with a $.)
That effectively makes it impossible to use ^ anchors in macros, and hard to use $. You probably don't need these anchors at all, so the easiest solution is likely to just get rid of them. But if you do need to anchor your patterns, you must do so in the rule itself, outside any macro.
You might also consider not relying on flex macros. Like C macros, they are not nearly as useful as they might at first appear. If you want meaningful names for character ranges, you'll find that flex already provides them: [[:digit:]] is [0-9]; [[:xdigit:]] is [0-9a-fA-F], and so on (the same categories as provided in C's <ctypes.h> header).

Related

Regular expression to count ALL newline characters in C++

I am trying to write a rules.l file to generate flex to read through any given input and print out every possible thing (for example - string, int, +, -, if, else, etc), along with its length, token, and what line it is on. Everything works as it should, except that it is not counting newline characters within a string literal.
I have googled my heart out and read all kinds of things, and they all say that just using the expression \n should allow me to count every newline in the text.
I also use [ \t] to eat whitespace.
My output should say:
< line: 14, lexeme: |"last"|
but instead it says:
> line: 10, lexeme: |"last"|
Any input/advice would be greatly appreciated!
Here is a bit of my .l file for context:
%option noyywrap
%{
int line_number = 1;
%}
%%
if { return TOK_IF; }
else { return TOK_ELSE; }
.
.
.
[a-zA-Z]([a-zA-Z]|[0-9]|"_")* { return TOK_IDENTIFIER; }
\"(\\.|[^"\\])*\" { return TOK_STRINGLIT; }
[ \t]+ ;
[\n] {++line_number;}
I'd suggest you add
%option yylineno
to your Flex file, and then use the yylineno variable instead of trying to count newlines yourself. Flex gets the value right, and usually manages to optimise the computation.
That said, \"([^"])*\" is not the optimal way to read string literals, because it will terminate at the first quote. That will fail disastrously if the string literal is "\"Bother,\" he said. \"It's too short.\""
Here's a better one:
\"(\\(.|\n)|[^\\"\n])*\"
(That will not match string literals which included unescaped newline characters; in C++, those are not legal. But you'll need to add another rule to match the erroneous string and produce an appropriate error message.)
I suppose it is possible that you must conform to the artificial requirements of a course designed by someone unaware of the yylineno feature. In that case, the simple solution of adding line_number = yylineno; at the beginning of every rule would probably be considered cheating.
What you will need to do is what flex itself does (but it doesn't make mistakes, and we programmers do): figure out which rules might match text including one or more newlines, and insert code in those specific rules to count the newlines matched. Typically, the rules in question are multi-line comments and string literals themselves (since the string literal might include a backslash line continuation.)
One way to figure out which rules might be matching newlines is to turn on the yylineno feature, and then examine the code generated by flex. Search for YY_RULE_SETUP in that file; the handler for every parser rule (including the ones whose action does nothing) starts with that macro invocation. If you have enabled %option yylineno, flex figures out which rules might match a newline, and inserts code before YY_RULE_SETUP to fix yylineno. These rules start with the comment /* rule N can match eol */, where N is the index of the rule. You'll need to count rules in your source file to match N with the line number. Or you can look for the #line directive in the generated code.

How can I use regular expressions to match a 'broken' string, or a proper string?

What I mean is that I need a regular expression that can match either something like this...
"I am a sentence."
or something like this...
"I am a sentence.
(notice the missing quotation mark at the end of the second one). My attempt at this so far is
["](\\.|[^"])*["]*
but that isn't working. Thanks for the help!
Edit for clarity: I am intending for this to be something like a C style string. I want functionality that will match with a string even if the string is not closed properly.
You could write the pattern as:
["](\\.|[^"\n])*["]?
which only has two small changes:
It excludes newline characters inside the string, so that the invalid string will only match to the end of the line. (. does not match newline, but a negated character class does, unless of course the newline is explicitly negated.)
It makes the closing doubke quote optional rather than arbitrarily repeated.
However, it is hard to imagine a use case in which you just want to silently ignore the error. So I wiuld recommend writing two rules:
["](\\.|[^"\n])*["] { /* valid string */ }
["](\\.|[^"\n])* { /* invalid string */ }
Note that the first pattern is guaranteed to match a valid string because it will match one more character than the other pattern and (f)lex always goes with the longer match.
Also, writing two overlapping rules like that does not cause any execution overhead, because of the way (f)lex compiles the patterns. In effect, the common prefix is automatically factored out.

How to specify a specific string in Regex

I'm tinkering around with flex and bison to create a small calculator program. The token will be something like this:
read A
read B
sum := A + B
write sum
Read, write will be keyword indicating reading a value in or writing a value to the output. ":=" is the assignment operator. A,B are identifiers, which can be strings. There will also be comment //comment and block comment /* asdfsd */
Would these regular expression be correct to specify the little grammar I specify?
[:][=] //assignment operator
[ \t] //skipping whitespace
[a-zA-Z0-9]+ //identifiers
[Rr][Ee][Aa][Dd] //read symbols, not case-sensitive
[/][/] `//comment`
For the assignment operator and the comment regex, can I just do this instead? would flex and bison accept it?
":=" //assignment operator
"//" //comment
Yes, ":=" and "//" will work, though the comment rule should really be "//".* because you want to skip everything after the // (until the end of line). If you just match "//", flex will try to tokenize what comes after it, which you don't want because a comment doesn't have to consist of valid tokens (and even if it did, those tokens should be seen by the parser).
Further [Rr][Ee][Aa][Dd] should be placed before the identifier rule. Otherwise it will never be matched (because if two rules can match the same lexeme, flex will pick the one that comes first in the file). It can also be written more succinctly as (?i:read) or you can enable case insensitivity globally with %option caseless and just write read.
You can start with (with ignore case option):
(read|write)\s+[a-z]+ will match read/write expression;
[a-z]+\s:=[a-z+\/* -]* will match assignation with simple calculus;
\/\/.* will match an inline comment;
\/\*[\s\S]*\*\/ will match multi-lines comments.
Keep in mind that theses are basic regex and may not fit for too complex syntaxes.
You can try it with Regex101.com for example

How to create a regex without certain group of letters in lex

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)
This is my code :
%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}
Some cases when the output should return ok
int var3
int _varR3
int _AA3_
And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.
So my question is what should I modify to make my expression ignore the 'dataType' words after space?
Thank you.
A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.
To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...
If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.
This is really not the way to solve this particular problem.
The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.
However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.
For example, you could do something like this:
dataType int|float|char|String
id [[:alpha:]_][[:alnum:]_]*
%%
{dataType}[[:white:]]+{dataType} { puts("Error: two types"); }
{dataType}[[:white:]]+{id} { puts("Valid declaration"); }
/* ... more rules ... */
The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

Regular expression is is looping the expression and returning it as one result

I am writing a perl script to generate .cpp files out of .h files using regular expressions to find functions, then using regex again to break the reults into two peices, the return type and the function.
I created a regular expression to find the return type which almost works.
^(\s*&?\w*\s*(\<{1}.*\>{1})*\s)
Edit: I updated the regex string to one that works better, but still no change as far as this question is concerned.
This works on most cpp prototypes such as
int funky();
int funky(int something);
&int funky(int something);
&int <vector *> funky();
in these cases the regex matches
int
int
&int
&int <vector *>
Which is perfect, however in cases where there is a string that matches inside the function arguments, such as:
int <vector> funky(int <vector> i);
int <vector> funky(int <vector *> i);
int <vector> funky(const int <vector> i);
The regex matches
int <vector> funky(int <vector>
int <vector> funky(int <vector *>
int <vector> funky(const int <vector>
When I want it to return
int <vector>
int <vector>
int <vector>
And I can't figure out whey its continuing past the end of the first closing bracket '>'. I am new to regular expressions and simply can't figure this out.
Sorry of there is an answer out there for this, I searched and haven't been able to find one, probably cause i don't even know what terms to look for :(.
Edit2: If this question is more complicated than I anticipated, could someone explain why it continues on past the first <.*>? I don't see why this doesn't work.
Regular expressions are great - for regular languages. However, most programming languages are not regular. Everything that has some sort of braces and recursion is a context free language, or even context dependend. (If these CS terms confuse you, look them up on Wikipedia. They are useful).
Especially C has a very complex grammar.
However, Perl's Regexes are not restricted to Regular Expressions. We can express context free grammars, i.e define a set of rules that the string must match. In each rule, we can reference other rules. Because of this, we can do recursion, and things like matching nested parens:
m{
^ (?&NESTED_PAREN) $
(?(DEFINE)
(?<NESTED_PAREN> [(] (?: [^()]+ | (?&NESTED_PAREN) )* [)] )
)
}x;
This regex defines a top rule: The whole string from beginning to end has to be a nested paren. Then follows a DEFINE block. We define one rule NESTED_PAREN, that starts with a ( and can contain any number of non-paren characters and nested parens. It is followed by a ). It has to be taken into account that it is easy to write an infinitely recursing grammar, but luckily each recursion will consume characters or fail in this example.
For a nicer interface to write grammars in Perl, I highly recommend Regexp::Grammars from CPAN.
Now we know how to write grammars in Perl and can create one for your function declarations.
Here is a symbolic notation without whitespaces:
FUNCTION ::= TYPE VECTOR? NAME '(' ARGUMENTS ')'
VECTOR ::= '<' vector '*'? '>'
ARGUMENTS::= ( ARGUMENT (',' ARGUMENT)* )?
ARGUMENT ::= TYPE VECTOR? NAME
You may have noticed that we can re-use some of the rules for the function inside the argument list. Now you just have to map this grammar to a set of (DEFINE) rules, write the top-level rule and you are ready to go. Again, using Regexp::Grammars will make this job much more easy, but it provides another language you will have to learn.
See perldoc perlre for the ultimate reference of built-in featurs in Perl regexes.
Please note that, (because of the preprocessor, among other things), the C (and C++) syntax is neither regular nor context-free. Everything short of executing the preprocessor will end up being a nice try…
Regular expressions are greedy. Use a ? following your .* to make it non-greedy and it will stop at the first match, rather than the last one.
^(\s*&?\w*\s*(\<{1}.*?\>{1})*\s)
More info at http://perldoc.perl.org/perlre.html#Regular-Expressions:
Here's another way to do it:
/^\s*&?\w*(\s+\<[^\>]+\>)?/
The part in parentheses (\s+\<[^\>]+\>)? is any text starting with spaces, then a "<" followed by any characters that are NOT a ">" (negation character class [^\>]+) and then a ">".
The negation character class with ">" makes sure that the matching will end as soon as the <> part ends. Also the parentheses are followed by a "?" making it an optional part of the expression.