I am working on writing a flex scanner for a language supporting nested comment like this:
/*
/**/
*/
I use to work on ocaml/ocamllex that support recursive calling lex scanner very elegent. But I am now switching to c++/flex, how to handle such nested comment?
Assuming that only comments can be nested in comments, a stack is a very expensive solution for what could be achieved with a simple counter. For example:
%x SC_COMMENT
%%
int comment_nesting = 0; /* Line 4 */
"/*" { BEGIN(SC_COMMENT); }
<SC_COMMENT>{
"/*" { ++comment_nesting; }
"*"+"/" { if (comment_nesting) --comment_nesting;
else BEGIN(INITIAL); }
"*"+ ; /* Line 11 */
[^/*\n]+ ; /* Line 12 */
[/] ; /* Line 13 */
\n ; /* Line 14 */
}
Some explanations:
Line 4: Indented lines before the first rule are inserted at the top of the yylex function where they can be used to declare and initialize local variables. We use this to initialize the comment nesting depth to 0 on every call to yylex. The invariant which must be maintained is that comment_nesting is always 0 in the INITIAL state.
Lines 11-13: A simpler solution would have been the single pattern .|\n. , but that would result in every comment character being treated as a separate subtoken. Even though the corresponding action does nothing, this would have caused the scan loop to be broken and the action switch statement to be executed for every character. So it is usually better to try to match several characters at once.
We need to be careful about / and * characters, though; we can only ignore those asterisks which we are certain are not part of the */ which terminates the (possibly nested) comment. Hence lines 11 and 12. (Line 12 won't match a sequence of asterisks which is followed by a / because those will already have been matched by the pattern above, at line 9.) And we need to ignore / if it is not followed by a *. Hence line 13.
Line 14: However, it can also be sub-optimal to match too large a token.
First, flex is not optimized for large tokens, and comments can be very large. If flex needs to refill its buffer in the middle of a token, it will retain the open token in the new buffer, and then rescan from the beginning of the token.
Second, flex scanners can automatically track the current line number, and they do so relatively efficiently. The scanner checks for newlines only in tokens matched by patterns which could possibly match a newline. But the entire match needs to be scanned.
We reduce the impact of both of these issues by matching newline characters inside comments as individual tokens. (Line 14, also see line 12) This limits the yylineno scan to a single character, and it also limits the expected length of internal comment tokens. The comment itself might be very large, but each line is likely to be limited to a reasonable length, thus avoiding the potentially quadratic rescan on buffer refill.
I resolve this problem by using yy_push_state , yy_pop_state and start condition like this :
%x comment
%%
"/*" {
yy_push_state(comment);
}
<comment>{
"*/" {
yy_pop_state();
}
"/*" {
yy_push_state(comment);
}
}
%%
In this way, I can handle any level of nested comment.
Related
I'm trying to make a multiline comment with this conditions:
Starts with ''' and finish with '''
Can't contain exactly three ''' inside, example:
'''''' Correct
'''''''' Correct
'''a'a''a''''a''' Correct
''''''''' Incorrect
'''a'''a''' Incorrect
This is my aproximation but I'm not able to make the correct expression for this:
'''([^']|'[^']|''[^']|''''+[^'])*'''+
The easy solution is to use a start condition. (Note that this doesn't pass on all your test cases, because I think the problem description is ambiguous. See below.)
In the following, I assume that you want to return the matched token, and that you are using a yacc/bison-generated parser which includes char* str as one of the union types. The start-condition block is a Flex extension; in the unlikely event that you're using some other lex derivative, you'll need to write out the patterms one per line, each one with the <SC_TRIPLE_QUOTE> prefix (and no space between that and the pattern).
%x SC_TRIPLE_QUOTE
%%
''' { BEGIN(TRIPLE_QUOTE); }
<TRIPLE_QUOTE>{
''' { yylval.str = strndup(yytext, yyleng - 3);
BEGIN(INITIAL);
return STRING_LITERAL;
}
[^']+ |
''?/[^'] |
''''+ { yymore(); }
<<EOF>> { yyerror("Unterminated triple-quoted string");
return 0;
}
}
I hope that's more or less self-explanatory. The four patterns inside the start condition match the ''' terminator, any sequence of characters other than ', no more than two ', and at least four '. yymore() causes the respective matches to be accumulated. The call to strndup excludes the delimiter.
Note:
The above code won't provide what you expect from the second example, because I don't think it is possible (or, alternatively, you need to be clearer about which of the two possible analyses is correct, and why). Consider the possible comments:
'''a'''
'''a''''
'''a''''a'''
According to your description (and your third example), the third one should match, with the internal value a''''a, because '''' is more than three quotes. But according to your second example (slightly modified), the second one should match, with the internal value ', because the final ''' is taken as a terminator. The question is, how are these two possible interpretations supposed to be distinguished? In other words, what clue does the lexical scanner have that in the second one case, the token ends at ''' while in the third one it doesn't? Since both of these are part of an input stream, there could be arbitrary text following. And since these are supposed multi-line comments, there's no apriori reason to believe that the newline character isn't part of the token.
So I made an arbitrary choice about which interpretation to choose. I could have made the other arbitrary choice, but then a different example wouldn't work.
I am trying to write a lex program which will remove both single line comment and multi-line comment.
%{
#include<stdio.h>
int single=0;
int multi=0;
%}
%%
"//"([a-z]|[A-Z]|[0-9]|" ")* {++single;}
"/*"(.*\n)* "*/" {++multi;}
%%
int main(int argc, int **argv)
{
yyin=fopen("abc.txt","r");
yylex();
printf("no of single line comment = %d ", single);
printf("no of multi line comment = %d ", multi);
return 0;
}
This program is not able to remove multi-line comment.
If there are multiple multi-line comments in your abc.txt file then your pattern for multi-line comment would match everything between start of first multi-line comment and end of last multi-line comment. This happens as lex exhibits greedy behavior and will try to match longest prefix of input string. And your pattern for multi-line comment allows /* and */ to be matched by (.*\n)*
Also your code would not detect single line comments that contain any characters other than alphanumeric characters and space (e.g - , ; : etc...).
Change your pattern actions to these and it should achieve your objective.
"//".*\n { ++single; }
"/*"[^*/]*"*/" { ++multi; }
Though above code will still leave some new lines in place of removed multi-line comments. Its bit tricky and I am not able to find out quick solution to remove those new lines.
Hope this helps!
For flex,
"//".* {singleLine++;}
"/*"([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+\/ {multiLine++;}
For detailed information: https://blog.ostermiller.org/finding-comments-in-source-code-using-regular-expressions/
I'm trying to write a lexer to parse a file like that looks this:
one.html /two/
one/two/ /three
three/four http://five.com
Each line has two strings separated by a space. I need to create two regex patterns: one to match the first string, and another to match the second string.
This is my attempt at the regex for the lexer (a file named lexer.l to be run by flex):
%%
(\S+)(?:\s+\S+) { printf("FIRST %s\n", yytext); }
(?:\S+\s+)(\S+) { printf("SECOND %s\n", yytext); }
. { printf("Mystery character %s\n", yytext); }
%%
I have tested both (\S+)(?:\s+\S+) and (?:\S+\s+)(\S+) in the Regex101 tester and they both seem to be working properly: https://regex101.com/r/FQTO15/1
However, when i try to build the lexer by running flex lexer.l, I get an error:
lexer.l:3: warning, rule cannot be matched
This is referring to the second rule I have. If I attempt to reverse the order of the rules, I get the error on the second one yet again. If I only leave in one of the rules, it works perfectly fine.
I believe this issue has to do with the fact that both regexes are similar and of the same length, so flex sees it as ambiguous, even though the two regexes capture different things (but they match the same things?).
Is there anything I can do with the regex so that it will capture/match what I want without clashing with each other?
EDIT: More Test Examples
one.html /two/
one/two.html /three/four/
one /two
one/two/ /three
one_two/ /three
one%20two/ /three
one/two/ /three/four
one/two /three/four/five/
one/two.html http://three.four.com/
one/two/index.html http://three.example.com/four/
one http://two.example.com/three
one/two.pdf https://example.com
one/two?query=string /three/four/
go.example.com https://example.com
EDIT
It turns out that the regex engine used by flex is rather limited. It cannot do grouping and it also doesn't seem to use \s for spaces.
So this wouldn't work:
^.*\s.*$
But this does:
^.*" ".*$
Thanks to #fossil for all their help.
Although there are ways to solve your problem as stated, I think you would be better off understanding the intended use of (f)lex, and to find a solution consistent with its processing model.
(F)lex is intended to split an input into individual tokens. Each token has a type, and it is expected that it is possible to figure out the type of a token simply by looking at it (and not at its context). The classic model of a token type are the objects in a computer program, where we have, for example, identifiers, numbers, certain keywords, and various operators. Given an appropriate set of rules, a (f)lex scanner will take an input like
a = b*7 + 2;
and produce a stream of tokens:
identifier = identifier * number + number ;
Each of these tokens has an associated "semantic value" (which not all of them actually require), so that the two identifier tokens and the two number are not just anonymous blobs.
Note that a and b in the above line have different roles. a is being assigned to, while b is being referred to. But that's not relevant to their form, and it is not evident from their form. They are just tokens. Figuring out what they mean and their relationship with each other is the role of a parser, which is a separate part of the parsing model. The intention of the two-phase scan/parse paradigm is to simplify both tasks by abstracting away complications: the scanner knows nothing about context or meaning, while the parser can deduce the logical structure of the input without concerning itself with the messy details of representation and irrelevant whitespace.
In many ways, your problem is a bit outside of this paradigm, in part because the two token types you have cannot be distinguished on the basis of their appearance alone. If they have no useful internal structure, though, then you could just accept that your input consists of
"paths", which do not contain whitespace, and
newline characters.
You could then use a combination of a lexer and a parser to break the input into lines:
File splitter.l
%{
#include "splitter.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
File splitter.y
%code {
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror(const char* msg);
}
%code requires {
#define YYSTYPE char*
}
%token PATH
%%
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
%%
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
int main(int argc, char** argv) {
return yyparse();
}
Quite a lot of the above is boiler-plate; it's worth concentrating just on the grammar and the token patterns.
The grammar is very simple:
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
The interesting line is the last one, which says that a line consists of two PATHs. That handles each line by printing it out, although you'd probably want to do something different. It is this line which understands that the first word on a line and the second word on the same line have different functions. Note that it doesn't need the lexer to label the two words as "FIRST" and "SECOND", since it can see that all by itself :)
The two calls to free release the memory allocated by strdup in the lexer, thus avoiding a memory leak. In a real application, you'd need to make sure you don't free the strings until you don't need them any more.
The lexer patterns are also very simple:
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
The first one returns a special single-character token, a newline character, to for the end-of-line token. The second one matches any string of non-whitespace characters. ((F)lex doesn't know about GNU regex extensions, so it doesn't have \s and friends. It does, however, have the much more readable Posix character classes, which are listed in the flex manual, among other places. The third pattern skips any whitespace. Since \n was already handled by the first pattern, it cannot be matched here (which is why this pattern is a single whitespace character and not a repetition.)
In the second pattern, we assign a value to yylval, which is the semantic value of the token. (We don't do this elsewhere because the newline token doesn't need a semantic value.) yylval always has type YYSTYPE, which we have arranged to be char* by a #define. Here, we just set it from yytext, which is the string of characters (f)lex has just matched. It is important to make a copy of this string because yytext is part of the lexer's internal structure, and its value will change without warning. Having made a copy of the string, we are then obliged to ensure that the memory is eventually released.
To try this program out:
bison -o splitter.tab.c -d splitter.y
flex -o splitter.lex.c splitter.l
gcc -Wall -O2 -o splitter splitter.tab.c splitter.lex.c
I am trying to look for a specific keyword in multi-line input string like this,
this is input line 1
this is the keyword line
this is another input line
this is the last input line
The multi-line input is stored in a variable called "$inputData". Now, I have 2 ways in mind to look for the word "keyword",
Method 1:
Using split to put the lines into an array using "\n" separator and iterate and process each line using foreach loop, like this,
my #opLines = split("\n", $inputData);
# process each line individually
foreach my $opLine ( #opLines )
{
# look for presence of "keyword" in the line
if(index($opLine, "keyword") > -1)
{
# further processing
}
}
Method 2:
Using regex, as below,
if($inputData =~ /keyword/m)
{
# further processing
}
I would like to know how these 2 methods compare with each other and What would be the better method with regards to actual code performance and execution time. Also, is there a better and more efficient way to go about this task?
my #opLines = split("\n", $inputData);
Will create variable #opLines, allocate memory, and search "\n" trough whole $inputData and write found lines into it.
# process each line individually
foreach my $opLine ( #opLines )
{
Will process the whole bunch of code for each value in array #opLines
# look for presence of "keyword" in the line
if(index($opLine, "keyword") > -1)
Will search for the "keyword" in each line.
{
# further processing
}
}
And comapare
if($inputData =~ /keyword/m)
Will search for the "keyword" and stops when find first occurrence.
{
# further processing
}
And now guess, what will be faster and consume less memory (which affects speed as well). If you are bad in guessing use Benchmark module.
According documentation m regular expression modifier: Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of line only at the left and right ends of the string to matching them anywhere within the string. I don't see neither ^ nor $ in your regexp so it is useless there.
I have a huge string (22000+ characters) of encoded text. The code is consisted of digits [0-9] and lower case letters [a-z]. I need a regular expression to insert a space after every 4 characters, and one to insert a line break [\n] after every fourty characters. Any ideas?
Many people would prefer to do this with a for loop and string concatenation, but I hate those substring calls. I am really against using regexes when they aren't the right tool for the job (parsing HTML), but I think it'd pretty easy to work with in this case.
JSFiddle Example
Let's say you have the string
var str = "aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooo";
And you want to insert a space after every four characters, and a newline after 40 characters, you could use the following code
str.replace(/.{4}g/, function (value, index){
return value + (index % 40 == 36? '\n' : ' ');
});
Note that this wouldn't work if the newline(40) index wasn't a multiple of the space index(4)
I abstracted this in a project, here's a simple way to do it
/**
* Adds padding and newlines into a string without whitespace
* #param {str} str The str to be modified (any whitespace will be stripped)
* #param {int} spaceEvery number of characters before inserting a space
* #param {int} wrapeEvery number of spaces before using a newline instead
* return {string} The replaced string
*/
function addPadding(str, spaceEvery, wrapEvery) {
var regex = new RegExp(".{"+spaceEvery+"}", "g");
// Add space every {spaceEvery} chars, newline after {wrapEvery} spaces
return str.replace(/[\n\s]/g, '').replace(regex, function(value, index) {
// The index is the group that just finished
var newlineIndex = spaceEvery * (wrapEvery - 1);
return value + ((index % (spaceEvery * wrapEvery) === newlineIndex) ? '\n' : ' ');
});
}
Well, a regexp in itself doesn't insert a space, so I'll assume you have some command in whatever language you're using that inserts based on finding a regexp.
So, finding 4 characters and finding 40 characters: that's not pretty in general regular expressions (unless your particular implementation has nice ways to express numbers). For finding 4 characters, use
....
Because typical regexp finders use maximal munch, then from the end of one regexp, search forward and maximally munch again, that'll chunk your string into 4 character pieces. The ugly part is that in standard regular expressions, you'll have to use
........................................
to find chuncks of 40 characters, although I'll note that if you run your 4 character one first, you'll have to run
..................................................
or
.... .... .... .... .... .... .... .... .... ....
to account for the spaces you've already put in.
The period finds any characters, but given that you're only using [0-9|a-z], you could use that regexp in place of each period if you need to ensure nothing else slipped in, I was just avoiding making it even more gross.
As you may be noting, regexp have some limitations. Take a look at the Chomsky hierarchy to really get into their theoretical limitations.