regex (in flex) for complete general sentence - regex

I am definig tokens inside flex as
%%
#[^\\\" \n\(\),=\{\}#~]+ {yylval.sval = strdup(yytext + 1); return ENTRYTYPE;}
[A-Za-z][A-Za-z0-9:"]* { yylval.sval = strdup(yytext); return KEY; }
\"([^"]|\\.)*\"|\{([^"]|\\.)*\} { yylval.sval = strdup(yytext); return VALUE; }
[ \t\n] ; /* ignore whitespace */
[{}=,] { return *yytext; }
. { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }
%%
(Though, not a good way)
The problem is the VALUE variable are doing fine for a quoted string, of the form "some quote"; but not for the form when they are enclosed by braces (of the form {some sentences}) as tried.
What is messy there?

I think that you want this, instead:
\"([^"]|\\.)*\"|\{([^\}]|\\.)*\} { yylval.sval = strdup(yytext); return VALUE; }
Even better, the following will be clearer and easier to maintain:
\"([^"]|\\.)*\" { yylval.sval = strdup(yytext); return VALUE; }
\{([^\}]|\\.)*\} { yylval.sval = strdup(yytext); return VALUE; }
Update
I have escaped the right brace in the character class expressions.

Related

Retrieve each token from a file according to specific criteria

I'm trying to create a lexer for a functional language, one of the methods of which should allow, on each call, to return the next token of a file.
For example :
func main() {
var MyVar : integer = 3+2;
}
So I would like every time the next method is called, the next token in that sequence is returned; in that case, it would look like this :
func
main
(
)
{
var
MyVar
:
integer
=
3
+
2
;
}
Except that the result I get is not what I expected:
func
main(
)
{
var
MyVar
:
integer
=
3+
2
}
Here is my method:
token_t Lexer::next() {
token_t ret;
std::string token_tmp;
bool IsSimpleQuote = false; // check string --> "..."
bool IsDoubleQuote = false; // check char --> '...'
bool IsComment = false; // check comments --> `...`
bool IterWhile = true;
while (IterWhile) {
bool IsInStc = (IsDoubleQuote || IsSimpleQuote || IsComment);
std::ifstream file_tmp(this->CurrentFilename);
if (this->eof) break;
char chr = this->File.get();
char next = file_tmp.seekg(this->CurrentCharIndex + 1).get();
++this->CurrentCharInCurrentLineIndex;
++this->CurrentCharIndex;
{
if (!IsInStc && !IsComment && chr == '`') IsComment = true; else if (!IsInStc && IsComment && chr == '`') { IsComment = false; continue; }
if (IsComment) continue;
if (!IsInStc && chr == '"') IsDoubleQuote = true;
else if (!IsInStc && chr == '\'') IsSimpleQuote = true;
else if (IsDoubleQuote && chr == '"') IsDoubleQuote = false;
else if (IsSimpleQuote && chr == '\'') IsSimpleQuote = false;
}
if (chr == '\n') {
++this->CurrentLineIndex;
this->CurrentCharInCurrentLineIndex = -1;
}
token_tmp += chr;
if (!IsInStc && IsLangDelim(chr)) IterWhile = false;
}
if (token_tmp.size() > 1 && System::Text::EndsWith(token_tmp, ";") || System::Text::EndsWith(token_tmp, " ")) token_tmp.pop_back();
++this->NbrOfTokens;
location_t pos;
pos.char_pos = this->CurrentCharInCurrentLineIndex;
pos.filename = this->CurrentFilename;
pos.line = this->CurrentLineIndex;
SetToken_t(&ret, token_tmp, TokenList::ToToken(token_tmp), pos);
return ret;
}
Here is the function IsLangDelim :
bool IsLangDelim(char chr) {
return (chr == ' ' || chr == '\t' || TokenList::IsSymbol(CharToString(chr)));
}
TokenList is a namespace that contains the list of tokens, as well as some functions (like IsSymbol in this case).
I have already tried other versions of this method, but the result is almost always the same.
Do you have any idea how to improve this method?
The solution for your problem is using a std::regex. Understanding the syntax is, in the beginning, a little bit difficult, but after you understand it, you will always use it.
And, it is designed to find tokens.
The specific critera can be expressed in the regex string.
For your case I will use: std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
This means:
Look for one or more characters (That is a word)
Look for one or more digits (That is a integer number)
Or look for all kind of meaningful operators (Like '+', '-', '{' and so on)
You can extend the regex for all the other stuff that you are searching. You can also regex a regex result.
Please see example below. That will create your shown output from your provided input.
And, your described task is only one statement in main.
#include <iostream>
#include <string>
#include <algorithm>
#include <regex>
// Our test data (raw string) .
std::string testData(
R"#(func main() {
var MyVar : integer = 3+2;
}
)#");
std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
int main(void)
{
std::copy(
std::sregex_token_iterator(testData.begin(), testData.end(), re, 1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n")
);
return 0;
}
You try to parse using single loop, which makes the code very complicated. Instead i suggest something like this:
struct token { ... };
struct lexer {
vector<token> tokens;
string source;
unsigned int pos;
bool parse_ident() {
if (!is_alpha(source[pos])) return false;
auto start = pos;
while(pos < source.size() && is_alnum(source[pos])) ++pos;
tokens.push_back({ token_type::ident, source.substr(start, pos - start) });
return true;
}
bool parse_num() { ... }
bool parse_comment() { ... }
...
bool parse_whitespace() { ... }
void parse() {
while(pos < source.size()) {
if (!parse_comment() && !parse_ident() && !parse_num() && ... && !parse_comment()) {
throw error{ "unexpected character at position " + std::to_string(pos) };
}
}
}
This is standard structure i use, when lexing my files in any scripting language i've written. Lexing is usually greedy, so you don't need to bother with regex (which is effective, but slower, unless some crazy template based implementation). Just define your parse_* functions, make sure they return false, if they didn't parsed a token and make sure they are called in correct order.
Order itself doesn't matter usually, but:
operators needs to be checked from longest to shortest
number in style .123 might be incorrectly recognized as . operator (so you need to make sure, that after . there is no digit.
numbers and identifiers are very lookalike, except that identifiers starts with non-number.

Regular expression for string excluding literal quotation marks

I have the following config file that I am trying to parse.
[ main ]
e_type=0x1B
username="username"
appname="applicationname"
In the lex file (test.l) specified below,the regular expression for STR is \"[^\"]*\" so that it recognizes everything within quotes.When I access the value of "username" or "applicationname" inside the parser file using $N variable, it contains the literal string.I just want
username and applicationname i.e without string quotation marks.
Is there a standard way to acheive this.
I have the following lex file (test.l)
%option noyywrap
%option yylineno
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "y.tab.h"
int yylinenu = 1;
int yycolno=1;
/**
* Forward declerations
**/
void Number ();
void HexaNumber ();
unsigned char getHexaLex (char c);
unsigned int strtol16 (char * str);
%}
%option nounput
%option noinput
%option case-insensitive
/*-----------------------------------------------------------------
Some macros (standard regular expressions)
------------------------------------------------------------------*/
DIGIT [0-9]
HEXALETTER [a-fA-F]
HEXANUMBER [0][x](({DIGIT}|{HEXALETTER})+)
NUM {DIGIT}+
HEXA ({DIGIT}|{HEXALETTER}|[*])
STR \"[^\"]*\"
WSPACE [ \t]*
NEWLINE [\n\r]
/*----------------------------------------------------------------
The lexer rules
------------------------------------------------------------------*/
%%
e_type { yylval.str = yytext; return T_E_TYPE; }
main { yylval.str = yytext; return T_MAIN_SECTION;}
{HEXANUMBER} { yylval.n = atoi(yytext); HexaNumber(); return T_NUMBER; }
= { return T_EQUAL; }
"[" { return T_OPEN_BRACKET; }
"]" { return T_CLOSE_BRACKET;}
appname { Custom_tag(); return T_APPNAME; }
username { Custom_tag(); return T_APPNAME; }
[^\t\n\r] { }
{WSPACE} { } /* whitespace: (do nothing) */
{NEWLINE} {
yylinenu++;
return T_EOL;
}
{STR} { Generic_string(); return T_STRING;}
%%
void Number () {
yylval.n = atol(yytext);
}
void Generic_string() {
yylval.str = malloc(strlen(yytext)+1);
strcpy (yylval.str, yytext);
}
You have a pointer to the matched token (yytext) and its length (yyleng), so it is quite simple to remove the quotes:
void Generic_string() {
yylval.str = malloc(yyleng - 1); // length - 2 (quotes) + 1 (NUL)
memcpy (yylval.str, yytext + 1, yyleng - 2); // copy all but quotes
yylval.str[yyleng - 2] = 0; // NUL-terminate
}
Personally, I'd suggest avoiding use of global variables in Generic_string, both to simplify future implementation of a reentrant scanner, and to make the process a bit more flexible:
{STR} { yylval.str = duplicate_segment(yytext + 1, yyleng - 2);
return T_STRING;
}
/* ... */
char* duplicate_segment(const char* token, int token_length) {
char* dup = malloc(token_length + 1);
if (!dup) { /* handle memory allocation error */ }
memcpy(dup, token, token_length);
dup[token_length] = 0;
return dup;
}

Why does this scanner not eat whitespaces?

These are my lexer-definitions, there are many lexer-definitions but this one is mine. I have several regexes trying to capture and ignore whitespace, from this sample. The error I get is that in line 1: 14 there is a $undefined Symbold to be found - that is of asci-value 32.
Also known as space.
OIL_VERSION = "314159OS";
CPU AT91SAM7S256
{
//Test Coment
OS HTOSEK
{
STATUS = EXTENDED;
STARTUPHOOK = TRUE;
ERRORHOOK = FALSE;
SHUTDOWNHOOK = FALSE;
PRETASKHOOK = FALSE;
POSTTASKHOOK = FALSE;
USEGETSERVICEID = FALSE;
USEPARAMETERACCESS = FALSE;
USERESSCHEDULER = FALSE;
USR_STACK_SIZE=3000;
};
/* Definition of application mode */
APPMODE appmode1{};
/* Definition of resource */
RESOURCE resource1
{
RESOURCEPROPERTY = STANDARD;
};
/* Definition of event */
EVENT event1
{
MASK = AUTO;
};
...
Lex Capture Definitions:
%{ /*** C/C++ Declarations ***/
#define MAX_INCLUDE_DEPTH 16
#include <string>
#include <sstream>
#define SSTR( x ) dynamic_cast< std::ostringstream & >( \
( std::ostringstream() << std::dec << x ) ).str()
#include "scanner.h"
/* import the parser's token type into a local typedef */
typedef implementation::Parser::token token;
typedef implementation::Parser::token_type token_type;
/* By default yylex returns int, we use token_type. Unfortunately yyterminate
* by default returns 0, which is not of token_type. */
#define yyterminate() return token::END
/* This disables inclusion of unistd.h, which is not available under Visual C++
* on Win32. The C++ scanner uses STL streams instead. */
#define YY_NO_UNISTD_H
static int once = 0;
static int lineno = 1;
static void nextLine()
{
lineno++;
}
//convert a str to int
int fromInt(char *s)
{
int i;
int m;
m = 1;
i = 0;
if (s[0]=='-'){
m = -1;
i = 1;
}
else if(s[0]=='+')
i = 1;
return((atoi(s+i))*m);
}
int fromHex(char *s)
{
return((int)strtol(s, NULL, 16));
}
int LineCounter=0;
%}
/*** Flex Declarations and Options ***/
/* enable c++ scanner class generation */
%option c++
/* change the name of the scanner class. results in "ExampleFlexLexer" */
%option prefix="Example"
/* the manual says "somewhat more optimized" */
%option batch
/* enable scanner to generate debug output. disable this for release
* versions. */
%option debug
/* no support for include files is planned */
%option yywrap nounput
/* enables the use of start condition stacks */
%option stack
%x C_COMMENT
%x incl
/* The following paragraph suffices to track locations accurately. Each time
* yylex is invoked, the begin position is moved onto the end position. */
%{
#define YY_USER_ACTION yylloc->columns(yyleng);
%}
%% /*** Regular Expressions Part ***/
/* code to place at the beginning of yylex() */
%{
// reset location
yylloc->step();
%}
"/*" { BEGIN(C_COMMENT); }
<C_COMMENT>"*/" { BEGIN(INITIAL); }
<C_COMMENT>. { }
"=" { return(token::EQ);
}
"[" { return(token::LBRACK);
}
"]" { return(token::RBRACK);
}
"OS" { return(token::OSEK);
}
"EVENT" { return(token::EVENT);
}
"TASK" { return(token::TASK);
}
"ALARM" { return(token::ALARM);
}
"COUNTER" { return(token::COUNTER);
}
"OIL_VERSION" { return(token::OIL_VERSION);
}
"APPMODE" { return(token::APPMODE);
}
"CPU" { return (token::CPU);
}
"true"|"TRUE" { yylval->integerVal =1; return(token::VAL_BOOL);
}
"false"|"FALSE" { yylval->integerVal =0; return(token::VAL_BOOL);
}
"BOOLEAN" { return(token::BOOLEAN);
}
"INT" { return(token::INT);
}
"{" { return(token::LBRACE);
}
"}" { return(token::RBRACE);
}
":" { return(token::COLON);
}
"," { return(token::COMMA);
}
";" { return(token::SEMI);
}
([_A-Za-z])([a-zA-Z0-9!^_])* {yylval->stringVal = new std::string(yytext, yyleng);
return(token::STRING);
}
(([+-])?([0-9])*) {yylval->integerVal = fromInt( yytext );
return(token::NUMERAL);
}
(("0x")([0-9ABCDEFabcdef])*) {yylval->integerVal = fromHex( yytext );
return(token::NUMERAL);
}
(([-+]?[1-9][0-9]+\.[0-9]*)|([-+]?[0-9]*\.[0-9]+)|([-+]?[1-9]+))([eE][-+]?[0-9]+)?(f)? { yylval->doubleVal=atof(yytext);
return (token::VAL_FLOAT);
}
[\n\r]+ {
//yylloc->lines(yyleng);
yylloc->step();
LineCounter++;
//return token::EOL;
}
[\r\n]+ {
//yylloc->lines(yyleng);
yylloc->step();LineCounter++;
//return token::EOL;
}
[\t\r]+ { /* gobble up white-spaces */ yylloc->step(); }
[\s]+ { yylloc->step(); }
\"([^\"])*\" {
yytext[yyleng-1]= 0;
yylval->stringVal = new std::string( yytext, yyleng);
return(token::STRING);
}
. {
unsigned int temp;
temp= (unsigned int)(*yytext);
std::stringstream str2;
str2<<temp;
std::cout<<"Unknown character"<<*yytext<<" as Asci-value : "<<str2.str()<<std::endl;
return static_cast<token_type>(*yytext);
}
%% /*** Additional Code ***/
namespace implementation {
Scanner::Scanner(std::istream* in,
std::ostream* out)
: ExampleFlexLexer(in, out)
{
}
Scanner::~Scanner()
{
}
void Scanner::set_debug(bool b)
{
yy_flex_debug = b;
}
}
/* This implementation of ExampleFlexLexer::yylex() is required to fill the
* vtable of the class ExampleFlexLexer. We define the scanner's main yylex
* function via YY_DECL to reside in the Scanner class instead. */
#ifdef yylex
#undef yylex
#endif
int ExampleFlexLexer::yylex()
{
std::cerr << "in ExampleFlexLexer::yylex() !" << std::endl;
return 0;
}
/* When the scanner receives an end-of-file indication from YY_INPUT, it then
* checks the yywrap() function. If yywrap() returns false (zero), then it is
* assumed that the function has gone ahead and set up `yyin' to point to
* another input file, and scanning continues. If it returns true (non-zero),
* then the scanner terminates, returning 0 to its caller. */
int ExampleFlexLexer::yywrap()
{
return 1;
}
I modified the last rule, so it simply doesent try to cast any unknown text and print out the ascisymbols it captures.. resulting in 32 47 47 32 " // ".
Will try to print out the stream..
flex does not implement perlisms such as \s. The only backslash escape sequences it recognized are standard C escapes such as \n. If you want to recognise a space character, use " ".
By the way, [\n\r]+ and [\r\n]+ recognize exactly the same thing: one or more repetitions of a single character which is either a newline or a return. So the second such rule will never match. I think flex will warn you about that.

Lex only detects symbols when there is whitespace between them

I want Lex, when given an input of "foo+1", to first return the identifier "foo", then the character '+', and then the integer 1. This works if I lex "foo + 1", but for some reason with the grammar I have, it doesn't work if I omit the spaces, and it skips over the '+', just returning "foo" and then 1. I can't figure out why. Is there anything here that seems problematic?
%{
#include "expression.h"
#include "picoScanner.h"
static int block_comment_num = 0;
static char to_char(char *str);
int yylineno = 0;
%}
%option nodefault yyclass="FlexScanner" noyywrap c++
%x LINE_COMMENT
%x BLOCK_COMMENT
%%
Any { return pico::BisonParser::token::ANY; }
Int { return pico::BisonParser::token::INT; }
Float { return pico::BisonParser::token::FLOAT; }
Char { return pico::BisonParser::token::CHAR; }
List { return pico::BisonParser::token::LIST; }
Array { return pico::BisonParser::token::ARRAY; }
Table { return pico::BisonParser::token::TABLE; }
alg { return pico::BisonParser::token::ALG; }
if { return pico::BisonParser::token::IF; }
then { return pico::BisonParser::token::THEN; }
else { return pico::BisonParser::token::ELSE; }
is { return pico::BisonParser::token::IS; }
or { return pico::BisonParser::token::OR; }
and { return pico::BisonParser::token::AND; }
not { return pico::BisonParser::token::NOT; }
when { return pico::BisonParser::token::WHEN; }
[A-Z][a-zA-Z0-9_]* { yylval->strval = new std::string(yytext);
return pico::BisonParser::token::TYPENAME; }
[a-z_][a-zA-Z0-9_]* { printf("saw '%s'\n", yytext); yylval->strval = new std::string(yytext);
return pico::BisonParser::token::ID; }
"==" { return pico::BisonParser::token::EQ; }
"<=" { return pico::BisonParser::token::LEQ; }
">=" { return pico::BisonParser::token::GEQ; }
"!=" { return pico::BisonParser::token::NEQ; }
"->" { return pico::BisonParser::token::RETURN; }
[\+\-\*/%] { return yytext[0]; }
[-+]?[0-9]+ { yylval->ival = atoi(yytext);
return pico::BisonParser::token::INT_LITERAL; }
([0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?) { yylval->fval = atof(yytext);
return pico::BisonParser::token::FLOAT_LITERAL; }
\"(\\.|[^\\"])*\" { yylval->strval = new std::string(strndup(yytext+1, strlen(yytext) - 2));
return pico::BisonParser::token::STRING_LITERAL; }
\'(\\.|[^\\'])*\' { yylval->cval = to_char(yytext+1);
return pico::BisonParser::token::CHAR_LITERAL; }
[ \t\r]+ { /* ignore */ }
\n { yylineno++; }
. { printf("~~~~~~~~~~munched %s\n", yytext); return yytext[0]; }
%%
static char to_char(char *str) {
if (strlen(str) <= 1) {
fprintf(stderr, "Error: empty character constant (line %d)\n", yylineno);
exit(1);
} else if (str[0] != '\\') {
return str[0];
} else {
if (strlen(str) == 1) {
fprintf(stderr, "Error: somehow we got a single slash character\n");
exit(1);
}
switch (str[1]) {
case 'n': return '\n';
case 'r': return '\r';
case 't': return '\t';
case 'a': return '\a';
case 'b': return '\b';
case 'f': return '\f';
case 'v': return '\v';
case '\'': return '\'';
case '"': return '"';
case '\\': return '\\';
case '?': return '\?';
case 'x':
fprintf(stderr, "Error: unicode not yet supported (line %d)\n", yylineno);
exit(1);
default:
fprintf(stderr, "Error: unrecognized escape sequence '\\%c' (line %d)\n",
str[1], yylineno);
exit(1);
}
}
}
I am not familair with lex, but I'm pretty sure the following causes the error:
[-+]?[0-9]+ { yylval->ival = atoi(yytext);
return pico::BisonParser::token::INT_LITERAL; }
foo is parsed as an identifier, but then "+0" is parsed as an int literal (due to the atoi conversion, the sign is discarded).
It is probably a good idea to only consider unsigned numeric literals at a lexer level, and handle signs at the level of the parser (treating the + and - tokens differently depending on their context).
Not only does this resolve the ambiguity, but it also enables you to "correctly" (in the sense that these are legal in C, C++, Java etc.) parse integer literals such as - 5 instead of -5.
Moreover: are the escaping backslashs in the arithmetic operator rule really necessary? Afaik, the only characters with special meaning inside a character class are -, ^, and ] (but I might be wrong).
Looks to me like it's matching foo+1 as foo and +1 (an INT_LITERAL). See related thread: Is it possible to set priorities for rules to avoid the "longest-earliest" matching pattern?
You could add an explicit rule to match + as a token, otherwise it sounds like Lex is going to take the longest match it can (+1 is longer than +).

Flex lexer output modification

How can I use flex lexer in C++ and modify a token's yytext value?
Lets say, I have a rule like this:
"/*" {
char c;
while(true)
{
c = yyinput();
if(c == '\n')
++mylineno;
if (c==EOF){
yyerror( "EOF occured while processing comment" );
break;
}
else if(c == '*')
{
if((c = yyinput()) == '/'){
return(tokens::COMMENT);}
else
unput(c);
}
}
}
And I want to get token tokens::COMMENT with value of comment between /* and */.
(The bove solution gives "/*" as the value.
Additional, very important is tracking the line number, so I'm looking for solution supporting it.
EDIT
Of course I can modify the yytext and yyleng values (like yytext+=1; yyleng-=1, but still I cannot solve the above problem)
I still think start conditions are the right answer.
%x C_COMMENT
char *str = NULL;
void addToString(char *data)
{
if(!str)
{
str = strdup(data);
}
else
{
/* handle string concatenation */
}
}
"/*" { BEGIN(C_COMMENT); }
<C_COMMENT>([^*\n\r]|(\*+([^*/\n\r])))* { addToString(yytext); }
<C_COMMENT>[\n\r] { /* handle tracking, add to string if desired */ }
<C_COMMENT>"*/" { BEGIN(INITIAL); }
I used the following as references:
http://ostermiller.org/findcomment.html
https://stackoverflow.com/a/2130124/1003855
You should be able to use a similar regular expression to handle strings.