Regex for lexing first and second string (separately) in a pair - regex

I'm trying to write a lexer to parse a file like that looks this:
one.html /two/
one/two/ /three
three/four http://five.com
Each line has two strings separated by a space. I need to create two regex patterns: one to match the first string, and another to match the second string.
This is my attempt at the regex for the lexer (a file named lexer.l to be run by flex):
%%
(\S+)(?:\s+\S+) { printf("FIRST %s\n", yytext); }
(?:\S+\s+)(\S+) { printf("SECOND %s\n", yytext); }
. { printf("Mystery character %s\n", yytext); }
%%
I have tested both (\S+)(?:\s+\S+) and (?:\S+\s+)(\S+) in the Regex101 tester and they both seem to be working properly: https://regex101.com/r/FQTO15/1
However, when i try to build the lexer by running flex lexer.l, I get an error:
lexer.l:3: warning, rule cannot be matched
This is referring to the second rule I have. If I attempt to reverse the order of the rules, I get the error on the second one yet again. If I only leave in one of the rules, it works perfectly fine.
I believe this issue has to do with the fact that both regexes are similar and of the same length, so flex sees it as ambiguous, even though the two regexes capture different things (but they match the same things?).
Is there anything I can do with the regex so that it will capture/match what I want without clashing with each other?
EDIT: More Test Examples
one.html /two/
one/two.html /three/four/
one /two
one/two/ /three
one_two/ /three
one%20two/ /three
one/two/ /three/four
one/two /three/four/five/
one/two.html http://three.four.com/
one/two/index.html http://three.example.com/four/
one http://two.example.com/three
one/two.pdf https://example.com
one/two?query=string /three/four/
go.example.com https://example.com
EDIT
It turns out that the regex engine used by flex is rather limited. It cannot do grouping and it also doesn't seem to use \s for spaces.
So this wouldn't work:
^.*\s.*$
But this does:
^.*" ".*$
Thanks to #fossil for all their help.

Although there are ways to solve your problem as stated, I think you would be better off understanding the intended use of (f)lex, and to find a solution consistent with its processing model.
(F)lex is intended to split an input into individual tokens. Each token has a type, and it is expected that it is possible to figure out the type of a token simply by looking at it (and not at its context). The classic model of a token type are the objects in a computer program, where we have, for example, identifiers, numbers, certain keywords, and various operators. Given an appropriate set of rules, a (f)lex scanner will take an input like
a = b*7 + 2;
and produce a stream of tokens:
identifier = identifier * number + number ;
Each of these tokens has an associated "semantic value" (which not all of them actually require), so that the two identifier tokens and the two number are not just anonymous blobs.
Note that a and b in the above line have different roles. a is being assigned to, while b is being referred to. But that's not relevant to their form, and it is not evident from their form. They are just tokens. Figuring out what they mean and their relationship with each other is the role of a parser, which is a separate part of the parsing model. The intention of the two-phase scan/parse paradigm is to simplify both tasks by abstracting away complications: the scanner knows nothing about context or meaning, while the parser can deduce the logical structure of the input without concerning itself with the messy details of representation and irrelevant whitespace.
In many ways, your problem is a bit outside of this paradigm, in part because the two token types you have cannot be distinguished on the basis of their appearance alone. If they have no useful internal structure, though, then you could just accept that your input consists of
"paths", which do not contain whitespace, and
newline characters.
You could then use a combination of a lexer and a parser to break the input into lines:
File splitter.l
%{
#include "splitter.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
File splitter.y
%code {
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror(const char* msg);
}
%code requires {
#define YYSTYPE char*
}
%token PATH
%%
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
%%
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
int main(int argc, char** argv) {
return yyparse();
}
Quite a lot of the above is boiler-plate; it's worth concentrating just on the grammar and the token patterns.
The grammar is very simple:
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
The interesting line is the last one, which says that a line consists of two PATHs. That handles each line by printing it out, although you'd probably want to do something different. It is this line which understands that the first word on a line and the second word on the same line have different functions. Note that it doesn't need the lexer to label the two words as "FIRST" and "SECOND", since it can see that all by itself :)
The two calls to free release the memory allocated by strdup in the lexer, thus avoiding a memory leak. In a real application, you'd need to make sure you don't free the strings until you don't need them any more.
The lexer patterns are also very simple:
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
The first one returns a special single-character token, a newline character, to for the end-of-line token. The second one matches any string of non-whitespace characters. ((F)lex doesn't know about GNU regex extensions, so it doesn't have \s and friends. It does, however, have the much more readable Posix character classes, which are listed in the flex manual, among other places. The third pattern skips any whitespace. Since \n was already handled by the first pattern, it cannot be matched here (which is why this pattern is a single whitespace character and not a repetition.)
In the second pattern, we assign a value to yylval, which is the semantic value of the token. (We don't do this elsewhere because the newline token doesn't need a semantic value.) yylval always has type YYSTYPE, which we have arranged to be char* by a #define. Here, we just set it from yytext, which is the string of characters (f)lex has just matched. It is important to make a copy of this string because yytext is part of the lexer's internal structure, and its value will change without warning. Having made a copy of the string, we are then obliged to ensure that the memory is eventually released.
To try this program out:
bison -o splitter.tab.c -d splitter.y
flex -o splitter.lex.c splitter.l
gcc -Wall -O2 -o splitter splitter.tab.c splitter.lex.c

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

Multi line comment flex lex

I'm trying to make a multiline comment with this conditions:
Starts with ''' and finish with '''
Can't contain exactly three ''' inside, example:
'''''' Correct
'''''''' Correct
'''a'a''a''''a''' Correct
''''''''' Incorrect
'''a'''a''' Incorrect
This is my aproximation but I'm not able to make the correct expression for this:
'''([^']|'[^']|''[^']|''''+[^'])*'''+
The easy solution is to use a start condition. (Note that this doesn't pass on all your test cases, because I think the problem description is ambiguous. See below.)
In the following, I assume that you want to return the matched token, and that you are using a yacc/bison-generated parser which includes char* str as one of the union types. The start-condition block is a Flex extension; in the unlikely event that you're using some other lex derivative, you'll need to write out the patterms one per line, each one with the <SC_TRIPLE_QUOTE> prefix (and no space between that and the pattern).
%x SC_TRIPLE_QUOTE
%%
''' { BEGIN(TRIPLE_QUOTE); }
<TRIPLE_QUOTE>{
''' { yylval.str = strndup(yytext, yyleng - 3);
BEGIN(INITIAL);
return STRING_LITERAL;
}
[^']+ |
''?/[^'] |
''''+ { yymore(); }
<<EOF>> { yyerror("Unterminated triple-quoted string");
return 0;
}
}
I hope that's more or less self-explanatory. The four patterns inside the start condition match the ''' terminator, any sequence of characters other than ', no more than two ', and at least four '. yymore() causes the respective matches to be accumulated. The call to strndup excludes the delimiter.
Note:
The above code won't provide what you expect from the second example, because I don't think it is possible (or, alternatively, you need to be clearer about which of the two possible analyses is correct, and why). Consider the possible comments:
'''a'''
'''a''''
'''a''''a'''
According to your description (and your third example), the third one should match, with the internal value a''''a, because '''' is more than three quotes. But according to your second example (slightly modified), the second one should match, with the internal value ', because the final ''' is taken as a terminator. The question is, how are these two possible interpretations supposed to be distinguished? In other words, what clue does the lexical scanner have that in the second one case, the token ends at ''' while in the third one it doesn't? Since both of these are part of an input stream, there could be arbitrary text following. And since these are supposed multi-line comments, there's no apriori reason to believe that the newline character isn't part of the token.
So I made an arbitrary choice about which interpretation to choose. I could have made the other arbitrary choice, but then a different example wouldn't work.

Creating a rule for a printable character in lex/yacc

I would like to create a grammar rule for a printable character (any character which returns true using C isprint() function.
For this purpose i created the following regex rule inside my lex file:
[\x20-\x7E] { yylval.ch = strdup(yytext); return CHARACTER; }
The regular expression contains all the printable characters based on their ASCII hexadecimal value.
On my first attempt this rule was located in the bottom, but any printable character that was already stated before obviously wasn't included, for example if my input was the character '+' and i had a previous rule:
"+" { return PLUS_OPERATOR; }
The parser accepted it as a PLUS_OPERATOR and not as CHARACTER.
Than i tried to place the character rule on top of my scanner, and from the same reason as before - all the following rules with characters in the printable range could not be matched.
My question is what can i do to create a rule that will match all printable characters but also rules for specific characters.
The only thing that i can think of is to putt it on the bottom and use a grammar rule with all one-character regular expression rules and the character rule (ex. CHAR : PLUS_OPERATOR | MINUS_OPERATOR | EQUAL_OPERATOR | CHARACTER)
I have a lot more than 3 one character rules in my lex file so obviously i'm looking for a more elegant solution.
The only solution is the one you propose: create a non-terminal which is the union of all the relevant terminals.
Personally, I find grammars much more readable if single-character tokens are written as themselves, so I would write:
printable: '+' | '-' | '=' | CHAR
in the bison file, and in the scanner:
[-+=] { yylval.ch = yytext[0]; return yylval.ch; }
[[:print:]] { yylval.ch = yytext[0]; return CHAR; }
(which in turn requires the semantic type to be a union with both char and char* fields; the advantage is that you don't need to worry about freeing the strings created for operator characters.)
That is about as elegant as it gets, I'm afraid.

how to handle nested comment in flex

I am working on writing a flex scanner for a language supporting nested comment like this:
/*
/**/
*/
I use to work on ocaml/ocamllex that support recursive calling lex scanner very elegent. But I am now switching to c++/flex, how to handle such nested comment?
Assuming that only comments can be nested in comments, a stack is a very expensive solution for what could be achieved with a simple counter. For example:
%x SC_COMMENT
%%
int comment_nesting = 0; /* Line 4 */
"/*" { BEGIN(SC_COMMENT); }
<SC_COMMENT>{
"/*" { ++comment_nesting; }
"*"+"/" { if (comment_nesting) --comment_nesting;
else BEGIN(INITIAL); }
"*"+ ; /* Line 11 */
[^/*\n]+ ; /* Line 12 */
[/] ; /* Line 13 */
\n ; /* Line 14 */
}
Some explanations:
Line 4: Indented lines before the first rule are inserted at the top of the yylex function where they can be used to declare and initialize local variables. We use this to initialize the comment nesting depth to 0 on every call to yylex. The invariant which must be maintained is that comment_nesting is always 0 in the INITIAL state.
Lines 11-13: A simpler solution would have been the single pattern .|\n. , but that would result in every comment character being treated as a separate subtoken. Even though the corresponding action does nothing, this would have caused the scan loop to be broken and the action switch statement to be executed for every character. So it is usually better to try to match several characters at once.
We need to be careful about / and * characters, though; we can only ignore those asterisks which we are certain are not part of the */ which terminates the (possibly nested) comment. Hence lines 11 and 12. (Line 12 won't match a sequence of asterisks which is followed by a / because those will already have been matched by the pattern above, at line 9.) And we need to ignore / if it is not followed by a *. Hence line 13.
Line 14: However, it can also be sub-optimal to match too large a token.
First, flex is not optimized for large tokens, and comments can be very large. If flex needs to refill its buffer in the middle of a token, it will retain the open token in the new buffer, and then rescan from the beginning of the token.
Second, flex scanners can automatically track the current line number, and they do so relatively efficiently. The scanner checks for newlines only in tokens matched by patterns which could possibly match a newline. But the entire match needs to be scanned.
We reduce the impact of both of these issues by matching newline characters inside comments as individual tokens. (Line 14, also see line 12) This limits the yylineno scan to a single character, and it also limits the expected length of internal comment tokens. The comment itself might be very large, but each line is likely to be limited to a reasonable length, thus avoiding the potentially quadratic rescan on buffer refill.
I resolve this problem by using yy_push_state , yy_pop_state and start condition like this :
%x comment
%%
"/*" {
yy_push_state(comment);
}
<comment>{
"*/" {
yy_pop_state();
}
"/*" {
yy_push_state(comment);
}
}
%%
In this way, I can handle any level of nested comment.

ANTLR4 Lexer Matching Start of Line End Of Line

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.
I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.
Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.
/* helo
world*/ # /* hel
l
o
*/ /*world */ifdef .....
is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)
This's what I am doing currently:
PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR);
But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.
However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.
If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.
You can use semantic predicates for the conditions.
PPLINE
: {getCharPositionInLine() == 0}?
(ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
{_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
-> channel(PPDIR)
;
You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.
Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).