Why is the flex regex being skipped? - regex

I can't, for the life of me, figure out what's wrong with my regex's.
What I'd like to tokenize are two (2) types of strings, both of which to be contained on a single line. One string can be anything (other than a new line), and the other, any alpha-numeric (ASCII) character and literal '_', '/' '-', and '.'.
The snippet of flex code is:
nl \n|\r\n|\r|\f|\n\r
...
%%
...
\"[^\"]+{nl} { frx_parser_error("Label is missing trailing double quote."); }
\"[a-zA-Z0-9_\.\/\-]+\" {
if (yyleng > 1024) frx_parser_error("File name too long.");
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_FILENAME: %s\n", frx_parser_lval.str);
return (TOSP_FILENAME);
}
\"[^{nl}]+\" {
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_IDENTIFIER:\n%s\n", frx_parser_lval.str);
return (TOSP_IDENTIFIER);
}
And when I run the parser, the fprintf's spit this out:
TOSP_FILENAME: ModStar-Picture-Analysis.txt
TOSP_FILENAME: ModStar-Rubric.log.txt
TOSP_IDENTIFIER:
picture-A"
Progress (26,255) camera 'C' root("picture-C-
Syntax (line 34): syntax error
For whatever reason, the quote after picture-A is being ... missed. Why? I checked the ASCII values for the eight locations the quote character appears and they're all 0x22 (where the double quutoes appear that is).
If I add some characters to the end of the "picture-A" it can work sometimes; adding ".par", ".pbr" doesn't work as expected, but ".pnr" does.
I've even added a specific non-regexy token:
\"picture-A\" { frx_parser_lval.str = strdup("picture-A"); return TOSP_FILENAME; }
to the lex file and it gets skipped.
I'm using flex 2.5.39, no flex libraries, one option (%option prefix=frx_parser_) in the lex file and the flex command line is:
flex -t script-lexer.l > script-lexer.c
What gives?
EDIT I need to test this on the actual system, but unit tests show this tokenizer to be much more robust (based on rici's answer):
nl \n|\r\n|\r|\f|\n\r
...
%%
...
["][^"]+{nl} { printf("Missing trailing quote.\n%s\n",yytext); }
["][[:alnum:]_./-]+["] { printf("File name:\n%s\n",yytext); }
["][^"]+["] { printf("String:\n%s\n",yytext); }
EDIT The rule ["].+["] swallows consecutive multiple strings as one big string. It was changed to ["][^"]+["]

The problem is your pattern:
\"[^{nl}]+\"
You're attempting to expand a definition inside a character class, but that is not possible; inside a character class, { is always just a {, not a flex operator. See the flex manual:
Note that inside of a character class, all regular expression operators lose their special meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning of the class, ‘^’.
A definition is not a macro. Rather, a definition defines a new regular expression operator.
As a consequence of the above, you can write [^\"] as simply [^"] and \"[a-zA-Z0-9_\.\/\-]+\" as \"[a-zA-Z0-9_./-]+\" (The - needs to be either at the end or at the beginning.) Personally, I'd write the second pattern as:
["][[:alnum:]_./-]+["]
But everyone has their own style.

Related

Invalid regular expression - Invalid property name in character class

I am using a fastify server, containing a typescript file that calls a function, which make sure people won't send unwanted characters. Here is the function :
const SAFE_STRING_REPLACE_REGEXP = /[^\p{Latin}\p{Zs}\p{M}\p{Nd}\-\'\s]/gu;
function secure(text:string) {
return text.replace(SAFE_STRING_REPLACE_REGEXP, "").trim();
}
But when I try to launch my server, I got an error message :
"Invalid regular expression - Invalid property name in character class".
It used to work just fine with my previous regex :
const SAFE_STRING_REPLACE_REGEXP = /[^0-9a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð\-\s\']/g;
function secure(text:string) {
return text.replace(SAFE_STRING_REPLACE_REGEXP, "").trim();
}
But I have been told it wasn't optimized enough. I have also been told it's better to use split/join than regex/replace in matter of performances, but I don't know if I can use it in my case.
You need to use
const SAFE_STRING_REPLACE_REGEXP = /[^\p{Script=Latin}\p{Zs}\p{M}\p{Nd}'\s-]/gu;
// or
const SAFE_STRING_REPLACE_REGEXP = /[^\p{sc=Latin}\p{Zs}\p{M}\p{Nd}'\s-]/gu;
You need to prefix scripts with sc= or Script= in Unicode category classes, so \p{Latin} should be specified as \p{Script=Latin}. See the ECMAScript reference.
Also, when you use the u flag, you cannot escape non-special chars, so do not escape ' and better move the - char to the end of the character class.
You can use split&join, too:
const SAFE_STRING_REPLACE_REGEXP = /[^\p{Script=Latin}\p{Zs}\p{M}\p{Nd}'\s-]/u;
console.log("Ącki-Łał русский!!!中国".split(SAFE_STRING_REPLACE_REGEXP).join(""))
Note you don't need the g modifier with split, it is the default behavior.

What happens with extra spaces and newlines in C/C++ code?

Is there a difference between;
int main(){
return 0;
}
and
int main(){return 0;}
and
int main(){
return
0;
}
They will all likely compile to same executable. How does the C/C++ compiler treat the extra spaces and newlines, and if there is a difference between how newlines are treated differently than spaces in C code?
Also, how about tabs? What's the significance of using tabs instead of spaces in code, if there is any?
Any sequence of 1+ whitespace symbol (space/line-break/tab/...) is equivalent to a single space.
Exceptions:
Whitespace is preserved in string literals. They can't contain line-breaks, except C++ raw literals (R"(...)"). The same applies to file names in #include.
Single-line comments (//) are terminated with line-breaks only.
Preprocessor directives (starting with #) are terminated with line-breaks only.
\ followed by a line-break removes both, allowing multi-line // comments, preprocessor directrives, and string literals.
Also, whitespace symbols are ignored if there is punctuation (anything except letters, numbers, and _) to the left and/or to the right of it. E.g. 1 + 2 and 1+2 are the same, but return a; and returna; are not.
Exceptions:
Whitespace is not ignored inside string literals, obviously. Nor in #include file names.
Operators consisting of >1 punctuation symbols can't be separated, e.g. cout < < 1 is illegal. The same applies to things like // and /* */.
A space between punctuation might be necessary to prevent it from coalescing into a single operator. Examples:
+ +a is different from ++a.
a+++b is equivalent to a++ +b, but not to a+ ++b.
Pre-C++11, closing two template argument lists in a row required a space: std::vector<std::vector<int> >.
When defining a function-like macro, the space is not allowed before the opening parenthesis (adding it turns it into an object-like macro). E.g. #define A() replaces A() with nothing, but #define A () replaces A with ().

Comment pattern match in flex using states

I am trying to match single line comment pattern in flex. Patterns of the comment could be:
//this is a single /(some random stuff) line comment
Or it could be like this:
// this is also a comment\
continuation of the comment from previous line
From the example it's obvious that I have to handle the multi-line case too.
Now my approach was using states. This is what I have so far:
"//" {
yymore();
BEGIN (SINGLE_COMMENT);
}
<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
yymore();
}
<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE}) {
logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
BEGIN (INITIAL);
}
NEWLINE is declared as:
NEWLINE \r?\n
My declaration unit:
%option noyywrap
%x SINGLE_COMMENT
int line_count = 1;
const int bucketSize = 10; // change if necessary
ofstream logout;
ofstream tokenout;
SymbolTable symbolTable(bucketSize);
Action of NEWLINE:
{NEWLINE} {
line_count++;
}
If I run it with the following input:
// hello\
int main
This is my log file:
Line no 1: TOKEN <COMMENT> Lexeme // hello\
found
Line no 1: TOKEN <INT> Lexeme int found
Line no 1: TOKEN <ID> Lexeme main found
ScopeTable # 1
6 --> < main , ID >
So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?
Link to code
In (f)lex, as in most regular expression engines, [ and ] enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^ and match exactly one character which is not a member of the set.
Character classes are not the same as sequences of characters:
ab matches an a followed by a b
[ab] matches either an a or a b
Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*, +, ?, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:
a* matches 0 or more as
[a*] matches either an a or a *
One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}. Here the { and } indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:
{identifier} matches whatever the expanded macro named identifier would match.
[{identifier}] matches a single character which is {, } or one of the letters definrt
Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.
It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p executes, then there cannot be any pattern which would match more than p. Thus, it might not be necessary to explicitly write the negation.
For example, in your comment detector where the only real issue is dealing with carriage return (\r) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:
<SINGLE_COMMENT>{
[^\\\r\n]+ ;
\\\r?\n { ++line_count; }
\\. ; /* only matches if the above rule doesn't */
\r?\n { ++line_count; BEGIN(INITIAL); }
\r ; /* only matches if the above rule doesn't */
}
By the way, it's usually much easier to provide %option yylineno than to try to track newlines manually.

How to Modify all beginnings and endings of a function

I would like to modify all the function which are of the following kind:
returnType functionName(parameters){
OLD_LOG; // Always the first line of the function
//stuff to do
return result; // may not be here in case of function returning void
} // The ending } is not always at the beginning of the line (but is always the first not white space of the line and has the same number of white space before than 'returnType' does)
by
returnType functionName(parameters){
NEW_LOG("functionName"); // the above function name
//stuff to do
END_LOG();
return result; //if any return (if possible, END_LOG() should appear just before any return, or at the end of the function if there is no return)
}
There is a at least a hundred of those functions.
Therefore I would like to know if it is possible to do that using a "look for/replace" in a text editor supporting regex for exemple, or anything else.
Thank you
here is an attempt for the same
Regex
/(?<=\s)(\w+)(?=\()(.*\{\n.*)(OLD_LOG;)(.*)(\n\})/s
Test String
returnType functionName(parameters){
OLD_LOG;
//stuff to do
}
Replace string
\1 \2NEW_LOG("\1");\n\4\n END_LOG();\5
Result
returnType functionName (parameters){
NEW_LOG("functionName");
//stuff to do
END_LOG();
}
live demo here
I have updated the regex to include optional return statement & optional spaces
Regex
/(?<=\s)(\w+)(?=\()(.*\{\n.*)(OLD_LOG;)(.*?)(?=(?:\s*)return|(?:\n\s*\}))/s
Replace string
\1 \2NEW_LOG("\1");\n\4\n END_LOG();
demo for return statement
demo for optional spaces
see if this works for you
Find
(\n([^\S\n]*)[^\s].*\s([^\s\(]+)\s*\(.*\)\s*\{\s*\n)(\s*)OLD\_LOG;((.*\s*\n)*?)(\s*return\s.*\r\n)?\2\}
Replace with
\1\4NEW\_LOG\(\"\3\"\);\5\4END_LOG\(\);\r\n\7\2\}
Notice that \n and \r\n are used. If your code file uses a different newline format, you need to modify accordingly.
Limitations of this replace are these assumptions:
1) OLD_LOG; is just one line below the function name.
2) Function has return type (any non space character before the function name is okay).
3) Function name and { are at the same line.
4) Ending } has the same number of white space before than 'returnType' does, and there is no such } inside the function.
5) Last return is just one line above the ending }.
It may be faster to use an editor with multiple carets support (e.g. Sublime Text, IntelliJ):
https://stackoverflow.com/a/18929134/802365
(Video) Multi-caret editing in Sublime Text

Grammar to Lex/Yacc

I have been tasked with a project that involves me taking a Grammar (in BNF form) and creating a lexical scanner (using lex) and a parser (using bison). I've never worked with any of these programs and I think a good reference would be to see how these items are created from a grammar. I am looking for a grammar and it's associated .l and .ypp files, preferably in C++. I've been able to find sample files or sample grammars, but not both of them. I've spent some time searching and I could not find anything. I figure I'd post here in hopes that someone has something for me, but I will continue searching in the meantime.
I am currently reading Tom Niemann's
http://epaperpress.com/lexandyacc/download/LexAndYaccTutorial.pdf which seems to be pretty well written and understandable.
Thanks
Edit: I am still searching, I am starting to think that what I am looking for does not exist. Google usually never fails me!
Edit 2: Maybe if I provide some of the grammar, you folks could show me what the appropriate .l and .ypp files would look like. This is just a snippet of the grammar, I just need a little 'taste' of how this works and I think I can take it from there.
Grammar:
Program ::= Compound
Statements ::= Compound | Assignment | ...
Assignment ::= Var ASSIGN Expression
Expression ::= Var | Operator Expression Expression | Number
Compound := START Statements END
Number ::= NUMBER
Descriptions:
Assignment is the equal sign ":="
Var is an identifier that begins with a lower case letter and is followed by lower case letters or digits
START is the "start" keyword
END is the "end keyword
Operator is "+", "-", "*", "/"
Number is decimal digits which could potentially be negative (minus sign in front)
Most of this is fairly straightforward. One part, however, is decidedly problematic. You've defined a number to (potentially) include a leading -, and that's a problem.
The problem is pretty simple. Given an input like 321-123, it's essentially impossible for the lexer (which won't normally keep track of current state) to guess at whether that's supposed to be two tokens (321 and -123 or three 321, -, 123). In this case, the - is almost certainly intended to be separate from the 123, but if the input were 321 + -123 you'd apparently want -123 as a single token instead.
To deal with that, you probably want to change your grammar so the leading - isn't part of the number. Instead, you always want to treat the - as an operator, and the number itself is composed solely of the digits. Then it's up to the parser to sort out expressions where the - is unary vs. binary.
Taking that into account, the lexer file would look something like this:
%{
#include "y.tab.h"
%}
%option noyywrap case-insensitive
%%
:= { return ASSIGN; }
start { return START; }
end { return END; }
[+/*] { return OPERATOR; }
- { return MINUS; }
[0-9]+ { return NUMBER; }
[a-z][a-z0-9]* { return VAR; }
[ \r\n] { ; }
%%
void yyerror(char const *s) { fputs(s, stderr); }
The matching yacc file would look something like this:
%token ASSIGN START END OPERATOR MINUS NUMBER VAR
%left '-' '+' '*' '/'
%%
program : compound
statement : compound
| assignment
;
assignment : VAR ASSIGN expression
;
statements :
| statements statement
;
expression : VAR
| expression OPERATOR expression
| expression MINUS expression
| value
;
value: NUMBER
| MINUS NUMBER
;
compound : START statements END
%%
int main() {
yyparse();
return 0;
}
Note: I've tested these only extremely minimally--enough to verify input I believe is grammatical, such as: start a:=1 b:=2 end and start a:=1+3*3 b:=a+4 c:=b*3 end is accepted (no error message printed out) and input I believe is un-grammatical, such as: 9:=13 and a=13 do both print out syntax error messages. Since this doesn't attempt to do any more with the expressions than recognize those which are or are not grammatical, that's about the best we can do though.