Regular expression to count ALL newline characters in C++ - c++

I am trying to write a rules.l file to generate flex to read through any given input and print out every possible thing (for example - string, int, +, -, if, else, etc), along with its length, token, and what line it is on. Everything works as it should, except that it is not counting newline characters within a string literal.
I have googled my heart out and read all kinds of things, and they all say that just using the expression \n should allow me to count every newline in the text.
I also use [ \t] to eat whitespace.
My output should say:
< line: 14, lexeme: |"last"|
but instead it says:
> line: 10, lexeme: |"last"|
Any input/advice would be greatly appreciated!
Here is a bit of my .l file for context:
%option noyywrap
%{
int line_number = 1;
%}
%%
if { return TOK_IF; }
else { return TOK_ELSE; }
.
.
.
[a-zA-Z]([a-zA-Z]|[0-9]|"_")* { return TOK_IDENTIFIER; }
\"(\\.|[^"\\])*\" { return TOK_STRINGLIT; }
[ \t]+ ;
[\n] {++line_number;}

I'd suggest you add
%option yylineno
to your Flex file, and then use the yylineno variable instead of trying to count newlines yourself. Flex gets the value right, and usually manages to optimise the computation.
That said, \"([^"])*\" is not the optimal way to read string literals, because it will terminate at the first quote. That will fail disastrously if the string literal is "\"Bother,\" he said. \"It's too short.\""
Here's a better one:
\"(\\(.|\n)|[^\\"\n])*\"
(That will not match string literals which included unescaped newline characters; in C++, those are not legal. But you'll need to add another rule to match the erroneous string and produce an appropriate error message.)
I suppose it is possible that you must conform to the artificial requirements of a course designed by someone unaware of the yylineno feature. In that case, the simple solution of adding line_number = yylineno; at the beginning of every rule would probably be considered cheating.
What you will need to do is what flex itself does (but it doesn't make mistakes, and we programmers do): figure out which rules might match text including one or more newlines, and insert code in those specific rules to count the newlines matched. Typically, the rules in question are multi-line comments and string literals themselves (since the string literal might include a backslash line continuation.)
One way to figure out which rules might be matching newlines is to turn on the yylineno feature, and then examine the code generated by flex. Search for YY_RULE_SETUP in that file; the handler for every parser rule (including the ones whose action does nothing) starts with that macro invocation. If you have enabled %option yylineno, flex figures out which rules might match a newline, and inserts code before YY_RULE_SETUP to fix yylineno. These rules start with the comment /* rule N can match eol */, where N is the index of the rule. You'll need to count rules in your source file to match N with the line number. Or you can look for the #line directive in the generated code.

Related

The Regex Rule of Flex

I am confused about the rule of flex lexer
My lexer can recognize the decimal and hex, but when I want to make a union for both of them as the Integer.
flex tells me it's test.l:13: unrecognized rule
here's my lexer file:
test.l
%{
#include <stdio.h>
#include <string.h>
int yylval;
%}
digit [0-9]
decimal ^({digit}|[1-9]{digit}+)$
hex 0[xX][0-9a-fA-F]+
integer {hex}|{decimal}
%%
{integer} {printf("integer - %s \n", yytext);}
%%
// run function
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}
Why do you think you need to anchor your decimal pattern? The way it is written, it will only match a number which is by itself on a line, without even any white space.
Anyway, it's the anchor which creates a problem. In (f)lex, ^ can only appear at the beginning of a pattern, and the macro expansion of {hex}|{decimal} has the ^ in the middle.
Changing it to {decimal}|{hex} won't help because flex normally surrounds macro expansions with parentheses to avoid incorrect operator grouping. (The parentheses are not inserted if the macro ends with $, but the immediate replacement body of {integer} doesn't end with a $.)
That effectively makes it impossible to use ^ anchors in macros, and hard to use $. You probably don't need these anchors at all, so the easiest solution is likely to just get rid of them. But if you do need to anchor your patterns, you must do so in the rule itself, outside any macro.
You might also consider not relying on flex macros. Like C macros, they are not nearly as useful as they might at first appear. If you want meaningful names for character ranges, you'll find that flex already provides them: [[:digit:]] is [0-9]; [[:xdigit:]] is [0-9a-fA-F], and so on (the same categories as provided in C's <ctypes.h> header).

Regular expression set max length for string literal

I am trying to figure out how to set the max length in a regular expression. My goal is to set my regular expression for string literals to a max length of 80.
Here is my expression if you need it:
["]([^"\\]|\\(.|\n))*["]|[']([^'\\]|\\(.|\n))*[']
I've tried adding {0,80} at both the front and the end of the expression, but either all the strings break down into smaller identifiers or none do so far.
Thanks in advance for the help!
Edit:
Here's a better explanation of what I am trying to accomplish.
Given "This string is over 80 characters long", when run through flex instead of being listed like:
line: 1, lexeme: |THIS STRING IS OVER 80 CHARACTERS LONG|, length: 81, token 4003
I would need it to be broken up like this:
line: 1, lexeme: |THIS|, length: 1, token 6000
line: 1, lexeme: |STRING|, length: 1, token 6000
line: 1, lexeme: |IS|, length: 1, token 6000
line: 1, lexeme: |OVER|, length: 1, token 6000
line: 1, lexeme: |80|, length: 1, token 6000
line: 1, lexeme: |CHARACTERS|, length: 1, token 6000
line: 1, lexeme: |LONG|, length: 1, token 6000
While string "THIS STRING IS NOT OVER 80 CHARACTERS LONG" would be shown as:
line: 1, lexeme: |THIS STRING IS NOT OVER 80 CHARACTERS LONG|, length: 50, token: 4003
I've tried adding {0,80} at both the front and the end of the expression
The brace operator is not a length limit; it's a range of repetition counts. It has to go where a repetition operator (*, + or ?) would go: immediately after the subpattern being repeated.
So in your case you might use: (I left out the ' alternative for clarity.)
["]([^"\\\n]|\\(.|\n)){0,80}["]
Normally, I would advise you not to do this, or at least to do it with some care. (F)lex regular expressions are compiled into state transition tables, and the only way to compile a maximum repetition count is to copy the subpattern states once for each repetition. So the above pattern needs to make 80 copies of the state transitions for ([^"\\]|\\(.|\n)). (With a simple subpattern like that, the state blow-up might not be too serious. But with more complex subpatterns, you can end up with enormous transition tables.)
Edit: Split long strings into tokens as though they weren't quoted.
An edit to the question suggests that what is expected is to treat strings of length greater than 80 characters as though the quote marks had never been entered; that is, report them as individual word and number tokens without any intervening whitespace. That seems so idiosyncratic to me that I can't convince myself that I'm reading the requirement correctly, but in case I am, here's the outline of a possible approach.
Let's suppose that the intention is that short strings should be reported as single tokens, while long strings should be reinterpreted as a series of tokens (possibly but not necessarily the same tokens as would be produced by an unquoted input). If that's the case, there are really two lexical analyses which need to be specified, and they will not use the same pattern rules. (For one thing, the rescan needs to recognise a quotation mark as the end of a literal, causing the scanner to revert to normal processing, while the original scan considers a quotation mark to start a string literal.)
One possibility would be to just collect the entire long string and then use a different lexical scanner to break it into whatever parts seem useful, but that would require some complicated plumbing in order to record the resulting token stream and return it one token at a time to the yylex caller. (This would be reasonably easy if yylex were pushing tokens to a parser, but that's a whole other scenario.) So I'm going to discard this option other than to mention that it is possible.
So the apparently simpler option is to ensure that the original scan halts on the 81st character, so that it can change the lexical rules and back up the scan to apply the new rules.
(F)lex provides start conditions as a way of providing different lexical contexts. By using the BEGIN macro in a (f)lex action, it is possible to dynamically switch between start conditions, switching the scanner into a different context. (They're called "start conditions" because they change the scanner's state at the start of the token.)
Each start condition (except the default start condition, which is called INITIAL) needs to be declared in the flex prologue. In this case, we'll only need one extra start condition, which we'll call SC_LONG_STRING. (By convention, start condition names are written in ALL CAPS since they are translated into either C macros or enumeration values.)
Flex has two possible mechanisms for backing up a scan, either of which will work here. I'm going to show the explicit back-up because it's safer and more flexible; the alternative is to use the trailing context operator (/) which would work perfectly well in this solution, but not in other very similar contexts.
So we start by declaring our start condition and then the rules for handling quoted strings in the default (INITIAL) lexical context:
%x SC_LONG_STRING
%%
I'm only showing the double-quote rules since the single-quote rules are virtually identical. (Single-quote will require another start condition, because the termination pattern is different.)
The first rule matches strings where there are at most 80 characters or escape sequences in the literal, using a repetition operator as described above.
["]([^"\\\n]|\\(.|\n)){0,80}["] { yylval.str = strndup(yytext + 1, yyleng - 2);
return TOKEN_STRING;
}
The second rule matches exactly one additional non-quote character. It does not attempt to find the end of the string; that will be handled within the SC_LONG_STRING rules. The rule does two things:
Switch to a different start condition.
Tell the scanner to backup the scan, using the yyless(n) special action, which truncates the current token at n characters, and causes the next token scan to restart at that point. So `yyless(1) leaves only the " in the current token. Since we don't return at this point, the current token is immediately discarded.
["]([^"\\\n]|\\(.|\n)){81} { BEGIN(SC_LONG_STRING); yyless(1); }
The final rule is a fallback for unterminated strings; it will trigger if something that looks like a string was started but neither of the above rules matched it. That can only happen if a newline or end-of-file indicator is encountered before the closing quote:
["]([^"\\\n]|\\(.|\n)){0,80} { yyerror("Unterminated string"); }
Now, we need to specify the rules for SC_LONG_STRING. For simplicity, I'm going to assume that it is only desired to split the string into whitespace separated units; if you want to do a different analysis, you can change the patterns here.
Start conditions are identified by writing the name of the start condition inside angle brackets. The start condition name is considered part of the pattern, so it should not be followed by a space (space characters aren't allowed in lex patterns unless they are quoted characters). Flex is more flexible; read the cited manual section for more details.
The first rule simply returns to the INITIAL start condition when a double quote terminates the string. The second rule discards whitespace inside the long string, and the third rule passes the whitespace-separated components on to the caller. Finally, we need to consider the possible error of an unterminated long string, which will result in encountering a newline or end-of-file indicator.
<SC_LONG_STRING>["] { BEGIN(INITIAL); }
<SC_LONG_STRING>[ \t]+ ;
<SC_LONG_STRING>([^"\\ \n\t]|\\(.|\n))+ { yylval.str = strdup(yytext);
return TOKEN_IDENTIFIER;
}
<SC_LONG_STRING>\n |
<SC_LONG_STRING><<EOF>> { yyerror("Unterminated string"); }
Original answer: Produce a meaningful error for long strings
You need to specify what you are planning to do if the user enters a string which is too long If your scanner doesn't recognise it as a string, then it will return some kind of fall-back token which will probably induce a syntax error from the parser; that provides no useful feedback to the user, so they probably won't have a clue where the syntax error comes from. And you certainly cannot restart the lexical analysis in the middle of a string which happens to be too long: that will end up interpreting text which was supposed to be quoted as though it were tokens.
A much better strategy is to recognise strings of any length, and then check the length in the action associated with the pattern. As a first approximation, you could try this:
["]([^"\\]|\\(.|\n)){0,80}["] { if (yyleng <= 82) return STRING;
else {
yyerror("String literal exceeds 80 characters");
return BAD_STRING;
}
}
(Note: (F)lex sets the variable yyleng to the length of yytext, so there is never a need to call strlen(yytext). strlen needs to scan its argument to compute the length, so it's quite a bit less efficient. Moreover, even in cases where you need to call strlen, you should not use it to check if a string exceeds a maximum length. Instead, use strnlen, which will limit the length of the scan.)
But that's just a first approximation, because it counts source characters, rather than the length of a string literal. So, for example, assuming you plan to allow hex escapes, the string literal "\x61" will be counted as though it has four characters, which could easily cause string literals containing escapes to be incorrectly rejected as too long.
That problem is ameliorated but not solved by the pattern with a limited repeat count, because the pattern itself does not fully parse escape sequences. In the pattern ["]([^"\\]|\\(.|\n)){0,80}["], the \x61 escape sequence will be counted as three repetitions (\x, 6, 1), which is still more than the single character it represents. As another example, splices (\ followed by a newline) will be counted as one repetition, whereas they don't contribute to the length of the literal at all since they are simply removed.
So if you want to correctly limit the length of a string literal, you will have to parse the source representation more precisely. (Or you would need to reparse it after identifying it, which seems wasteful.) This is usually done with a start condition.
In case you use the regex inside flex, and need to monitor its length, the easiest thing would be to look at the matched string held in yylex (or similar):
["]([^"\\]|\\(.|\n))*["]|[']([^'\\]|\\(.|\n))*['] { if (strlen(yylex) > 82) { ... } }
I used 82 to account for the two double quotes characters '"'.

How to specify a specific string in Regex

I'm tinkering around with flex and bison to create a small calculator program. The token will be something like this:
read A
read B
sum := A + B
write sum
Read, write will be keyword indicating reading a value in or writing a value to the output. ":=" is the assignment operator. A,B are identifiers, which can be strings. There will also be comment //comment and block comment /* asdfsd */
Would these regular expression be correct to specify the little grammar I specify?
[:][=] //assignment operator
[ \t] //skipping whitespace
[a-zA-Z0-9]+ //identifiers
[Rr][Ee][Aa][Dd] //read symbols, not case-sensitive
[/][/] `//comment`
For the assignment operator and the comment regex, can I just do this instead? would flex and bison accept it?
":=" //assignment operator
"//" //comment
Yes, ":=" and "//" will work, though the comment rule should really be "//".* because you want to skip everything after the // (until the end of line). If you just match "//", flex will try to tokenize what comes after it, which you don't want because a comment doesn't have to consist of valid tokens (and even if it did, those tokens should be seen by the parser).
Further [Rr][Ee][Aa][Dd] should be placed before the identifier rule. Otherwise it will never be matched (because if two rules can match the same lexeme, flex will pick the one that comes first in the file). It can also be written more succinctly as (?i:read) or you can enable case insensitivity globally with %option caseless and just write read.
You can start with (with ignore case option):
(read|write)\s+[a-z]+ will match read/write expression;
[a-z]+\s:=[a-z+\/* -]* will match assignation with simple calculus;
\/\/.* will match an inline comment;
\/\*[\s\S]*\*\/ will match multi-lines comments.
Keep in mind that theses are basic regex and may not fit for too complex syntaxes.
You can try it with Regex101.com for example

How to create a regex without certain group of letters in lex

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)
This is my code :
%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}
Some cases when the output should return ok
int var3
int _varR3
int _AA3_
And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.
So my question is what should I modify to make my expression ignore the 'dataType' words after space?
Thank you.
A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.
To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...
If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.
This is really not the way to solve this particular problem.
The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.
However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.
For example, you could do something like this:
dataType int|float|char|String
id [[:alpha:]_][[:alnum:]_]*
%%
{dataType}[[:white:]]+{dataType} { puts("Error: two types"); }
{dataType}[[:white:]]+{id} { puts("Valid declaration"); }
/* ... more rules ... */
The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

How do I scan for a "string" constant in a flex scanner?

As part of a class assignment to create a flex scanner, I need to create a rule that recognizes a string constant. That is, a collection of characters between a set of quotation marks. How do I identify a bad string?
The only way a string literal can be "bad" is if it is missing the closing quote mark. Unfortunately, that is not easy to detect, since it is likely that there is another string literal in the program, and the opening quote of the following string literal will be taken as the missing close quote. Once the quote marks are out of synch, the lexical scan will continue incorrectly until the end of file is detected inside a supposed string literal, at which point an error can be reported.
Languages like the C family do not allow string literals to contain newline characters, which allows missing quotes to be detected earlier. In that case, a "bad" string literal is one which contains a newline. It's quite possible that the lexical scan will incorrectly include characters which were intended to be outside of the string literal, but error recovery is somewhat easier than in languages in which a missing quote effectively inverts the entire program.
It's worth noting that it is almost as common to accidentally fail to escape a quote inside a quoted string, which will result in the string literal being closed prematurely; the intended close quote will then be lexed as an open quote, and the eventual lexical error will again be delayed.
(F)lex uses the "longest match" rule to identify which pattern to recognize. If the string pattern doesn't allow newlines, as in C, it might be (in a simplified version, leaving out the complication of escapes) something like:
\"[^"]*\"
(remembering that in flex, . does not match a newline.) If the closing quote is not present in the line, this pattern will not match, and it is likely that the fallback pattern will succeed, matching only the open quote. That's good enough if immediate failure is acceptable, but if you want to do error recovery, you probably want to ignore the rest of the line. In that case, you might add a pattern such as
\"[^"]*
That will match every valid string as well, of course (not including the closing quote) but it doesn't matter because the valid string literal pattern's match will be longer (by one character). So the pattern without the closing quote will only match unterminated string literals.