Is it bad idea using regex to tokenize string for lexer? - regex

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).
For instance,
begin x:=1;y:=2;
then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).

Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.
Although performance-wise it can be more efficient if you do it yourself, it isn't a must.

Using regular expressions is THE traditional way to generate your tokens.
lex and yacc (or flex and bison) are a traditional compiler creation pair, where lex does nothing except tokenize symbols and pass them to YACC
http://en.wikipedia.org/wiki/Lex_%28software%29
YACC is a stack based state machine (pushdown automaton) that processes the symbols.
I think regex processing is the way to go for parsing symbols of any level of complexity. As Oak mentions, you'll end up writing your own (probably inferior) regex parser. The only exception would be if it is dead simple, and even your posted example starts to exceed "dead simple".
in lex syntax:
:= return ASSIGN_TOKEN_OR_WHATEVER;
begin return BEGIN_TOKEN;
[0-9]+ return NUMBER;
[a-zA-Z][a-zA-Z0-9]* return WORD;
Character sequences are optionally passed along with the token.
Individual characters that are tokens in their own right [e.g. ";" )get passed along unmodified. Its not the only way, but I have found it to work very well.
Have a look:
http://www.faqs.org/docs/Linux-HOWTO/Lex-YACC-HOWTO.html

Related

Group multiple regular expressions for reuse in lex

I want to use multiple regular expressions as follows (pseudo code):
[0-9]+|[0-9]+.[0-9]+ - number
+|-|*|/ - sign
[number][sign][number]=[number] - math expression
The closest thing i found was this, but the code is in JavaScript, while i want to use lex / flex for it.
Is it possible using the normal RegEx syntax?
(F)lex provides the possibility to define what are, in effect, macros. The definitions go in the definitions section (before the first %%) and the simple syntax is described in the flex manual, with eχamples.
So you could write
number [0-9]+|[0-9]+.[0-9]+
sign [+*/-]
%%
{number}{sign}{number}={number} { /* do something */ }
But that would rarely be a good idea, and it is certainly not the intended use of (f)lex. Normally, you would use flex to decompose the input into a sequence of five tokens: a number, an operator, another number, an =, and a final number. You would use a parser, which repeatedly calls the flex-generated scanner, to construct an object representing the equation (or to verify the equation, if that was the intent.)
If you use the regex proposed in your question, you will almosrt certainly end up rescanning the matched equation in order to extract its components; it is almst always better to avoid the rescan.

How to create a regex without certain group of letters in lex

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)
This is my code :
%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}
Some cases when the output should return ok
int var3
int _varR3
int _AA3_
And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.
So my question is what should I modify to make my expression ignore the 'dataType' words after space?
Thank you.
A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.
To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...
If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.
This is really not the way to solve this particular problem.
The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.
However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.
For example, you could do something like this:
dataType int|float|char|String
id [[:alpha:]_][[:alnum:]_]*
%%
{dataType}[[:white:]]+{dataType} { puts("Error: two types"); }
{dataType}[[:white:]]+{id} { puts("Valid declaration"); }
/* ... more rules ... */
The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

How to make flex try the second longest matching regular expression?

This question might sound a little confusing. I'm using Flex to pass tokens to Bison.
The behavior I want is that Flex matches the longest regular expression and passes that token (it DOES work like this), but if that token doesn't work with the grammar, it then matches the second longest regular expression and passes that token.
I'm struggling to think of a way to create this behavior. How could I make this happen?
To clarify, for example, say I have two rules:
"//" return TOKEN_1;
"///" return TOKEN_2;
Given the string "///", I'd like it to first pass TOKEN_2 (it does).
If TOKEN_2 doesn't fit with the grammar as specified in Bison, it then passes TOKEN_1 (which is also valid).
How can I create this behavior?
In flex, you can have a rule that tries to do something but fails and tries the second-best rule by using the REJECT macro:
REJECT directs the scanner to proceed on to the
"second best" rule which matched the input (or a
prefix of the input). The rule is chosen as
described above in "How the Input is Matched", and
yytext and yyleng set up appropriately. It may
either be one which matched as much text as the
originally chosen rule but came later in the flex
input file, or one which matched less text.
(source: The Flex Manual Page).
So to answer your question about getting the second-longest expression, you might be able to do this using REJECT (though you have to be careful, because it could just pick something of the same length with equal priority).
Note that flex will run slower with REJECT being used because it needs to maintain extra logic to "fall back" to worse matches at any point. I'd suggest only using this if there's no other way to fix your problem.
Sorry but you cant do that. I'm actually unsure how much flex talks to bison. I do know there is a mode for REPL parsing and i do know there is another mode that parses it all.
You'll have to inline the rule. For example instead of // and / you write a rule which accepts /// then another that assumes /// means // /. But that gets messy and i only did that in a specific case in my code.
I would just have the lexer scan two tokens // and / and then have the parser deal with cases when they are supposed to be regarded as one token, or separate. I.e. a grammar rule that begins with /// can actually be refactored into one which starts with // and /. In other words, do not have a TOKEN_2 at all. In same cases this sort of thing can be tricky though, because the LARL(1) parser has only one token of lookahead. It has to make a shift or reduce decision based on seeing the // only, without regard for the / which follows.
I had an idea for solving this with a hacky approach involving a lexical tie in, but it proved unworkable.
The main flaw with the idea is that there isn't any easy way to do error recovery in yacc which is hidden from the user. If a syntax error is triggered, that is visible. The yyerror function could contain a hack to try to hide this, but it lacks the context information.
In other words, you can't really use Yacc error actions to trigger a backtracking search for another parse.
This is tough for bison/yacc to deal with, as it doesn't do backtracking. Even if you use a backtracking parser generator like btyacc, it doesn't really help unless it also backtracks through the lexer (which would likely require a parser generator with integrated lexer.)
My suggestion would be to have the lexer recognize a slash immediately followed by a slash specially and return a different token:
\//\/ return SLASH;
\/ return '/'; /* not needed if you have the catch-all rule: */
. return *yytext;
Now you need to 'assemble' multi-slash 'tokens' as non-terminals in the grammer.
single_slash: SLASH | '/' ;
double_slash: SLASH SLASH | SLASH '/' ;
triple_slash: SLASH SLASH SLASH | SLASH SLASH '/' ;
However, you'll now likely find you have conflicts in the grammar due to the 1-token lookahead not being enough. You may be able to resolve those by using btyacc or bison's %glr-parser option.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.