How to create a regex without certain group of letters in lex - regex

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)
This is my code :
%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}
Some cases when the output should return ok
int var3
int _varR3
int _AA3_
And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.
So my question is what should I modify to make my expression ignore the 'dataType' words after space?
Thank you.

A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.
To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...
If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.

This is really not the way to solve this particular problem.
The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.
However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.
For example, you could do something like this:
dataType int|float|char|String
id [[:alpha:]_][[:alnum:]_]*
%%
{dataType}[[:white:]]+{dataType} { puts("Error: two types"); }
{dataType}[[:white:]]+{id} { puts("Valid declaration"); }
/* ... more rules ... */
The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

Related

Regular expression to count ALL newline characters in C++

I am trying to write a rules.l file to generate flex to read through any given input and print out every possible thing (for example - string, int, +, -, if, else, etc), along with its length, token, and what line it is on. Everything works as it should, except that it is not counting newline characters within a string literal.
I have googled my heart out and read all kinds of things, and they all say that just using the expression \n should allow me to count every newline in the text.
I also use [ \t] to eat whitespace.
My output should say:
< line: 14, lexeme: |"last"|
but instead it says:
> line: 10, lexeme: |"last"|
Any input/advice would be greatly appreciated!
Here is a bit of my .l file for context:
%option noyywrap
%{
int line_number = 1;
%}
%%
if { return TOK_IF; }
else { return TOK_ELSE; }
.
.
.
[a-zA-Z]([a-zA-Z]|[0-9]|"_")* { return TOK_IDENTIFIER; }
\"(\\.|[^"\\])*\" { return TOK_STRINGLIT; }
[ \t]+ ;
[\n] {++line_number;}
I'd suggest you add
%option yylineno
to your Flex file, and then use the yylineno variable instead of trying to count newlines yourself. Flex gets the value right, and usually manages to optimise the computation.
That said, \"([^"])*\" is not the optimal way to read string literals, because it will terminate at the first quote. That will fail disastrously if the string literal is "\"Bother,\" he said. \"It's too short.\""
Here's a better one:
\"(\\(.|\n)|[^\\"\n])*\"
(That will not match string literals which included unescaped newline characters; in C++, those are not legal. But you'll need to add another rule to match the erroneous string and produce an appropriate error message.)
I suppose it is possible that you must conform to the artificial requirements of a course designed by someone unaware of the yylineno feature. In that case, the simple solution of adding line_number = yylineno; at the beginning of every rule would probably be considered cheating.
What you will need to do is what flex itself does (but it doesn't make mistakes, and we programmers do): figure out which rules might match text including one or more newlines, and insert code in those specific rules to count the newlines matched. Typically, the rules in question are multi-line comments and string literals themselves (since the string literal might include a backslash line continuation.)
One way to figure out which rules might be matching newlines is to turn on the yylineno feature, and then examine the code generated by flex. Search for YY_RULE_SETUP in that file; the handler for every parser rule (including the ones whose action does nothing) starts with that macro invocation. If you have enabled %option yylineno, flex figures out which rules might match a newline, and inserts code before YY_RULE_SETUP to fix yylineno. These rules start with the comment /* rule N can match eol */, where N is the index of the rule. You'll need to count rules in your source file to match N with the line number. Or you can look for the #line directive in the generated code.

Regex Beginner; Writing Lex/Flex compatible Regular Expressions (specifically to identify even integers)

So I am looking to write a simple flex program, in which I would like to use a Regex Expression to identify integers (and separating them from any "whitespace"). I then use C code blocks, in which I will increment integerCount and evenCount (initialised to 0) respectively. I am completely new to both Flex and writing regular expressions. I am using the book Flex/Bison by O'Reilly Media as a reference
for writing flex programs. Because I am unfamiliar with regular expressions in general, I resorted to google for read-ups which has lead me to the following websites:
Regexr.com helped me understand regular expressions better as I
was able to toy around with them and see in real time the changes I
am actually making. The problem is that I was able to successfully
write the regex I wanted on the website (I going to put this down at the bottom of the page so it is formatted better) however it does not function as intended
within flex. This lead me to realise that flex does not use the same
notation/rules for Regular Expressions that I am used to.
This site compares the Rules of regular expressions in Perl, Grep
and Lex. As you can see, many of the functionalities I have used
to build my regular expressions aren't compatible with Lex. As I understand it, I am not working with whitespace per say, but ASCII space, carridge return etc.
Below is the Regex Expression I had created on Regexr.com to identify stray even integers.
\d+[02468]+((\n)|(\s)|($)){1}
As this is compatible, I had to make some changes. I can figure out how to swap \d with [0-9], however swapping \n with carriage return to \x0D and \s with space \x0 doesn't seem to be the right approach.
I am using flex to compile the program to lex.yy.c , and calling "cc lex.yy.c -lfl" to compile it to the a.out executable program. This works only on Linux and not OSX.
Here is a link to my solution.l program at the moment.
If you have any advice for me, I would really appreciate your guidance. In any case, thank you for reading.
To match integers, you simply need:
[[:digit:]]+ { /* handle a number */ }
If you want to match even integers, you can use
[[:digit:]]*[02468] { /* handle an even number */ }
If you want to match both even and odd integers, doing something different for each parity, you would use two rules:
[[:digit:]]*[02468] { /* handle an even number */ }
[[:digit:]]*[13579] { /* handle an odd number */ }
Or you could do it with the first two patterns, as long as you put them into in the right order:
[[:digit:]]*[02468] { /* handle an even number */ }
[[:digit:]]+ { /* handle any other number */ }
This works because (f)lex always uses the first rule if two patterns are equally good.
There's no point in trying to match whitespace or newlines as part of the number. They're not part of the number, and the idea of flex is that you are breaking the input up into meaningful pieces ("tokens"). It might be that you don't care about other pieces of the input, but you still need to recognise them, if only to explicitly ignore them. For example, to ignore anything which is not a digit, you could add the following rule:
[^[:digit:]] ; /* Do nothing*/
The semi-colon is required because (f)lex doesn't allow empty actions.
Now, it's possible that your intention was to not count numbers in the middle of a word, like F29 or 23skidoo. In that case, you would want to add another pattern which recognises those strings which might contain numbers. Then you will probably have to recognise whitespace explicitly, rather than lumping it in with "not a digit". Surprisingly, this is pretty simple:
[[:digit:]]*[02468] { /* handle an even number */ }
[[:digit:]]+ { /* handle any other number */ }
[[:space:]]+ ; /* Ignore whitespace */
[^[:space:]]+ ; /* Ignore everything else */
The last pattern might need some explanation, since a number is also a sequence of non-whitespace characters. But it works for the same reason we don't need an explicit match for odd numbers; the "maximal munch" rule says that (f)lex always uses the pattern with the longest match, and if there is more than one pattern tied for longest match, it uses the first one. In other words, if a sequence of characters delimited by whitespace happens to be a number, one of the number rules will be chosen rather than the last rule. On the other hand, if a number is immediately followed by garbage, the last rule will be used because it has a longer match.

How to specify a specific string in Regex

I'm tinkering around with flex and bison to create a small calculator program. The token will be something like this:
read A
read B
sum := A + B
write sum
Read, write will be keyword indicating reading a value in or writing a value to the output. ":=" is the assignment operator. A,B are identifiers, which can be strings. There will also be comment //comment and block comment /* asdfsd */
Would these regular expression be correct to specify the little grammar I specify?
[:][=] //assignment operator
[ \t] //skipping whitespace
[a-zA-Z0-9]+ //identifiers
[Rr][Ee][Aa][Dd] //read symbols, not case-sensitive
[/][/] `//comment`
For the assignment operator and the comment regex, can I just do this instead? would flex and bison accept it?
":=" //assignment operator
"//" //comment
Yes, ":=" and "//" will work, though the comment rule should really be "//".* because you want to skip everything after the // (until the end of line). If you just match "//", flex will try to tokenize what comes after it, which you don't want because a comment doesn't have to consist of valid tokens (and even if it did, those tokens should be seen by the parser).
Further [Rr][Ee][Aa][Dd] should be placed before the identifier rule. Otherwise it will never be matched (because if two rules can match the same lexeme, flex will pick the one that comes first in the file). It can also be written more succinctly as (?i:read) or you can enable case insensitivity globally with %option caseless and just write read.
You can start with (with ignore case option):
(read|write)\s+[a-z]+ will match read/write expression;
[a-z]+\s:=[a-z+\/* -]* will match assignation with simple calculus;
\/\/.* will match an inline comment;
\/\*[\s\S]*\*\/ will match multi-lines comments.
Keep in mind that theses are basic regex and may not fit for too complex syntaxes.
You can try it with Regex101.com for example

ANTLR (ANTLR3) 2 cases or anything else pattern [duplicate]

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).
For instance,
begin x:=1;y:=2;
then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).
Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.
Although performance-wise it can be more efficient if you do it yourself, it isn't a must.
Using regular expressions is THE traditional way to generate your tokens.
lex and yacc (or flex and bison) are a traditional compiler creation pair, where lex does nothing except tokenize symbols and pass them to YACC
http://en.wikipedia.org/wiki/Lex_%28software%29
YACC is a stack based state machine (pushdown automaton) that processes the symbols.
I think regex processing is the way to go for parsing symbols of any level of complexity. As Oak mentions, you'll end up writing your own (probably inferior) regex parser. The only exception would be if it is dead simple, and even your posted example starts to exceed "dead simple".
in lex syntax:
:= return ASSIGN_TOKEN_OR_WHATEVER;
begin return BEGIN_TOKEN;
[0-9]+ return NUMBER;
[a-zA-Z][a-zA-Z0-9]* return WORD;
Character sequences are optionally passed along with the token.
Individual characters that are tokens in their own right [e.g. ";" )get passed along unmodified. Its not the only way, but I have found it to work very well.
Have a look:
http://www.faqs.org/docs/Linux-HOWTO/Lex-YACC-HOWTO.html