Comment pattern match in flex using states

Comment pattern match in flex using states - regex

I am trying to match single line comment pattern in flex. Patterns of the comment could be:
//this is a single /(some random stuff) line comment
Or it could be like this:
// this is also a comment\
continuation of the comment from previous line
From the example it's obvious that I have to handle the multi-line case too.
Now my approach was using states. This is what I have so far:
"//" {
yymore();
BEGIN (SINGLE_COMMENT);
}
<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
yymore();
}
<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE}) {
logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
BEGIN (INITIAL);
}
NEWLINE is declared as:
NEWLINE \r?\n
My declaration unit:
%option noyywrap
%x SINGLE_COMMENT
int line_count = 1;
const int bucketSize = 10; // change if necessary
ofstream logout;
ofstream tokenout;
SymbolTable symbolTable(bucketSize);
Action of NEWLINE:
{NEWLINE} {
line_count++;
}
If I run it with the following input:
// hello\
int main
This is my log file:
Line no 1: TOKEN <COMMENT> Lexeme // hello\
found
Line no 1: TOKEN <INT> Lexeme int found
Line no 1: TOKEN <ID> Lexeme main found
ScopeTable # 1
6 --> < main , ID >
So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?
Link to code

In (f)lex, as in most regular expression engines, [ and ] enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^ and match exactly one character which is not a member of the set.
Character classes are not the same as sequences of characters:
ab matches an a followed by a b
[ab] matches either an a or a b
Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*, +, ?, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:
a* matches 0 or more as
[a*] matches either an a or a *
One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}. Here the { and } indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:
{identifier} matches whatever the expanded macro named identifier would match.
[{identifier}] matches a single character which is {, } or one of the letters definrt
Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.
It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p executes, then there cannot be any pattern which would match more than p. Thus, it might not be necessary to explicitly write the negation.
For example, in your comment detector where the only real issue is dealing with carriage return (\r) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:
<SINGLE_COMMENT>{
[^\\\r\n]+ ;
\\\r?\n { ++line_count; }
\\. ; /* only matches if the above rule doesn't */
\r?\n { ++line_count; BEGIN(INITIAL); }
\r ; /* only matches if the above rule doesn't */
}
By the way, it's usually much easier to provide %option yylineno than to try to track newlines manually.

Related

What happens with extra spaces and newlines in C/C++ code?

Is there a difference between;
int main(){
return 0;
}
and
int main(){return 0;}
and
int main(){
return
0;
}
They will all likely compile to same executable. How does the C/C++ compiler treat the extra spaces and newlines, and if there is a difference between how newlines are treated differently than spaces in C code?
Also, how about tabs? What's the significance of using tabs instead of spaces in code, if there is any?

Any sequence of 1+ whitespace symbol (space/line-break/tab/...) is equivalent to a single space.
Exceptions:
Whitespace is preserved in string literals. They can't contain line-breaks, except C++ raw literals (R"(...)"). The same applies to file names in #include.
Single-line comments (//) are terminated with line-breaks only.
Preprocessor directives (starting with #) are terminated with line-breaks only.
\ followed by a line-break removes both, allowing multi-line // comments, preprocessor directrives, and string literals.
Also, whitespace symbols are ignored if there is punctuation (anything except letters, numbers, and _) to the left and/or to the right of it. E.g. 1 + 2 and 1+2 are the same, but return a; and returna; are not.
Exceptions:
Whitespace is not ignored inside string literals, obviously. Nor in #include file names.
Operators consisting of >1 punctuation symbols can't be separated, e.g. cout < < 1 is illegal. The same applies to things like // and /* */.
A space between punctuation might be necessary to prevent it from coalescing into a single operator. Examples:
+ +a is different from ++a.
a+++b is equivalent to a++ +b, but not to a+ ++b.
Pre-C++11, closing two template argument lists in a row required a space: std::vector<std::vector<int> >.
When defining a function-like macro, the space is not allowed before the opening parenthesis (adding it turns it into an object-like macro). E.g. #define A() replaces A() with nothing, but #define A () replaces A with ().

recursive matching for string delimiter with regular expression

In verilog language, the statements are enclosed in a begin-end delimiter instead of bracket.
always# (*) begin
if (condA) begin
a = c
end
else begin
b = d
end
end
I'd like to parse outermost begin-end with its statements to check coding rule in python. Using regular expression, I want results with regular expression like:
if (condA) begin
a = c
end
else begin
b = d
end
I found similar answer for bracket delimiter.
int funcA() {
if (condA) {
b = a
}
}
regular expression:
/({(?>[^{}]+|(?R))*})/g
However, I don't know how to modify atomic group ([^{}]) for "begin-end"?
/(begin(?>[??????]+|(?R))*end)/g

The point of the [??????]+ part is to match any text that does not match a char that is equal or is the starting point of the delimiters.
So, in your case, you need to match any char other than a char that starts either begin or end substring:
/begin(?>(?!begin|end).|(?R))*end/gs
See the regex demo
The . here will match any char including line break chars due to the s modifier. Note that the actual implementation might need adjustments (e.g. in PHP, the g modifier should not be used as there are specific functions/features for that).
Also, since you recurse the whole pattern, you need no outer parentheses.

Why is the flex regex being skipped?

I can't, for the life of me, figure out what's wrong with my regex's.
What I'd like to tokenize are two (2) types of strings, both of which to be contained on a single line. One string can be anything (other than a new line), and the other, any alpha-numeric (ASCII) character and literal '_', '/' '-', and '.'.
The snippet of flex code is:
nl \n|\r\n|\r|\f|\n\r
...
%%
...
\"[^\"]+{nl} { frx_parser_error("Label is missing trailing double quote."); }
\"[a-zA-Z0-9_\.\/\-]+\" {
if (yyleng > 1024) frx_parser_error("File name too long.");
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_FILENAME: %s\n", frx_parser_lval.str);
return (TOSP_FILENAME);
}
\"[^{nl}]+\" {
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_IDENTIFIER:\n%s\n", frx_parser_lval.str);
return (TOSP_IDENTIFIER);
}
And when I run the parser, the fprintf's spit this out:
TOSP_FILENAME: ModStar-Picture-Analysis.txt
TOSP_FILENAME: ModStar-Rubric.log.txt
TOSP_IDENTIFIER:
picture-A"
Progress (26,255) camera 'C' root("picture-C-
Syntax (line 34): syntax error
For whatever reason, the quote after picture-A is being ... missed. Why? I checked the ASCII values for the eight locations the quote character appears and they're all 0x22 (where the double quutoes appear that is).
If I add some characters to the end of the "picture-A" it can work sometimes; adding ".par", ".pbr" doesn't work as expected, but ".pnr" does.
I've even added a specific non-regexy token:
\"picture-A\" { frx_parser_lval.str = strdup("picture-A"); return TOSP_FILENAME; }
to the lex file and it gets skipped.
I'm using flex 2.5.39, no flex libraries, one option (%option prefix=frx_parser_) in the lex file and the flex command line is:
flex -t script-lexer.l > script-lexer.c
What gives?
EDIT I need to test this on the actual system, but unit tests show this tokenizer to be much more robust (based on rici's answer):
nl \n|\r\n|\r|\f|\n\r
...
%%
...
["][^"]+{nl} { printf("Missing trailing quote.\n%s\n",yytext); }
["][[:alnum:]_./-]+["] { printf("File name:\n%s\n",yytext); }
["][^"]+["] { printf("String:\n%s\n",yytext); }
EDIT The rule ["].+["] swallows consecutive multiple strings as one big string. It was changed to ["][^"]+["]

The problem is your pattern:
\"[^{nl}]+\"
You're attempting to expand a definition inside a character class, but that is not possible; inside a character class, { is always just a {, not a flex operator. See the flex manual:
Note that inside of a character class, all regular expression operators lose their special meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning of the class, ‘^’.
A definition is not a macro. Rather, a definition defines a new regular expression operator.
As a consequence of the above, you can write [^\"] as simply [^"] and \"[a-zA-Z0-9_\.\/\-]+\" as \"[a-zA-Z0-9_./-]+\" (The - needs to be either at the end or at the beginning.) Personally, I'd write the second pattern as:
["][[:alnum:]_./-]+["]
But everyone has their own style.

C++ TR1 regex - multiline option

I thought that $ indicates the end of string. However, the following piece of code gives "testbbbccc" as a result, which is quite astonishing to me... This means that $ actually matches end of line, not end of the whole string.
#include <iostream>
#include <regex>
using namespace std;
int main()
{
tr1::regex r("aaa([^]*?)(ogr|$)");
string test("bbbaaatestbbbccc\nddd");
vector<int> captures;
captures.push_back(1);
const std::tr1::sregex_token_iterator end;
for (std::tr1::sregex_token_iterator iter(test.begin(), test.end(), r, captures); iter != end; )
{
string& t1 = iter->str();
iter++;
cout << t1;
}
}
I have been trying to find a "multiline" switch (which actually can be easily found in PCRE), but without success... Can someone point me to the right direction?
Regards,
R.P.

As Boost::Regex was selected for tr1, try the following:
From Boost::Regex
Anchors:
A '^' character shall match the start
of a line when used as the first
character of an expression, or the
first character of a sub-expression.
A '$' character shall match the end of
a line when used as the last character
of an expression, or the last
character of a sub-expression.
So the behavior you observed is correct.
From: Boost Regex as well:
\A Matches at the start of a buffer
only (the same as \`).
\z Matches at
the end of a buffer only (the same as
\').
\Z Matches an optional sequence
of newlines at the end of a buffer:
equivalent to the regular expression
\n*\z
I hope that helps.

There is no multiline switch in TR1 regexs. It's not exactly the same, but you could get the same functionality matching everything:
(.|\r|\n)*?
This matches non-greedily every character, including new line and carriage return.
Note: Remember to escape the backslashes '\' like this '\\' if your pattern is a C++ string in code.
Note 2: If you don't want to capture the matched contents, append '?:' to the opening bracket:
(?:.|\r|\n)*?

Regex to read a class file

I am looking for a Regex to read a c# class file like
MyClass{}
The regex shld return "MyClass" when passed the string "MyClass{}".
Edited:
Common
{
MyClass1
{
Method1
{
"Helloworld";
"GoodBye";
}
Method2
{
"SayGoodMorning";
}
}
MyClass2
{
Method3
{
"M3";
}
}
}
Actually i have to read a hierarchy like above, there can b n number of it and have to read before and inside {}

If your regex engine supports lookahead, then you can use
\b[^{\s]+(?=\s*\{)
This will match ["Common", "MyClass1", "Method1", "Method2", "MyClass2", "Method3"] in your example.
Explanation:
\b: Start the match at a word boundary.
[^{\s]+: Match one or more characters except opening braces or whitespace.
(?=\s*\{): Assert that the match ends with a character that is followed by optional whitespace (including linebreaks) and an opening brace.
Restricting matches to certain hierarchy levels (in this case, only match at the second level of nesting) is not possible with regular expressions in general. It may be possible in certain regex dialects, but this is stretching the limits of what regexes are designed for - a parser would suit this better.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comment pattern match in flex using states - regex

Related

What happens with extra spaces and newlines in C/C++ code?

recursive matching for string delimiter with regular expression

Why is the flex regex being skipped?

C++ TR1 regex - multiline option

Regex to read a class file

Categories

Resources