Regex clarification on escape sequences with lex - regex

I'm creating a lexer.l file that is working as intended except for one part. I have the rule:
[\(\*.*\*\)] {}
which I want to make it so when I encounter (* this is a test *) in a file, I simply do nothing with it. However when I run lex lexer.l I get warning on lines with rules \(, \*, and \) stating that they can never be met. So I guess my question is why would [\(\*.*\*\)] {} interfere with \( and the others? How can I catch (* this is a test *)?

Languages with the comment syntax (*…*) typically allow nested comments, and nested comments cannot easily be recognized by (f)lex because the nesting requires a context-free grammar, and the lexical scanner only implements regular languages.
If your comments do not nest (so that (* something (* else *) is a comment, rather than the prefix of a longer comment), then you can use the regular expression
[(][*][^*]*[*]+([^*)][^*]*[*]+)*[)]
If you do require nested comments, you can use start conditions and a stack (or a simulated stack, as below):
%x SC_COMMENT
%%
int comment_nesting = 0;
"(*" { BEGIN(SC_COMMENT); }
<SC_COMMENT>{
"(*" { ++comment_nesting; }
"*"+")" { if (comment_nesting) --comment_nesting;
else BEGIN(INITIAL); }
"*"+ ;
[^(*\n]+ ;
[(] ;
\n ;
}
That snippet was taken from this answer, with a small adjustment because that answer recognizes nested /*…*/ comments. A fuller explanation of the code appears there.

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

A Perl 6 Regex to match a Perl 6 delimited comment

Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing.
As an example of what I am looking for, I want something that can parse the comments in here:
#`{ foo {} bar }
#`« woo woo »
say #`(
This is a (
long )
multiliner()) "You rock!"
#`{{ { And don't forget the tricky repeating delimiters }}
My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.
Matching your examples
my %openers-closers = < { } « » ( ) >; # (many more in reality)
my #openers = %openers-closers.keys; # { « ( ...
my ($open, $close); # possibly multiple chars
my token comment { '#`' <&open> <&middle> <&close> }
my token open {
# Store first delimiter char: Slurp as many as are repeated:
( ( #openers ) $0* )
# Store the full (possibly multiple character) delimiters:
{ $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars }
}
my token middle {
:my $nest-level; # for tracking nesting
[
# Continue if nested: or if not at unnested end delimiter:
[ <?{$nest-level}> || <!&close> ]
# Match either a nested delimiter: or a single character:
( $open || $close || . )
# Keep track of nesting:
{ $_ = ~$0.tail; # set topic to latest match in list
$nest-level++ when $open; $nest-level-- when $close }
]*
}
my token close { $close }
.say for $your-examples ~~ m:g / <.&comment> /
displays:
「{ foo {} bar }」
「« woo woo »」
「(
This is a (
long )
multiliner())」
「{{ { And don't forget the tricky repeating delimiters }}」
Hopefully the code is self-explanatory if you know Raku regexes. Please use the comments if you want clarification of any of it.
Looking at related Rakudo source code
I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.)
But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case.
As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12):
#`{{ {{ And don't forget the tricky repeating delimiters } }}
yields the rather LTA (Less Than Awesome) compiler error:
Starter {{ is immediately followed by a combining codepoint...
This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules.
So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a fail-terminator method in the Raku grammar. (Perhaps not of interest to you but it is to me.)
Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments:
The comment:sym<#`(...)> token that parses these comments. This leads to:
The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.
The quibble token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:
The babble token. This establishes a "start" and "stop" with this code:
$<B>=[<?before .>]
{
# Work out the delimiters.
my $c := $/;
my #delims := $c.peek_delimiters($c.target, $c.pos);
my $start := #delims[0];
my $stop := #delims[1];
The rule peek_delimiters is not in the Raku grammar file.
A search in the Rakudo repo shows it's not anywhere in Rakudo or Raku.
A search in NQP yields a routine in nqp's grammar (from which the Raku grammar inherits, which is why the peek_delimiters call works and why I looked in NQP when I didn't find it in Rakudo/Raku).
I'll stop at this point to draw a conclusion.
Conclusion
You've got a regex. It might work out as you intend. I don't know.
If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

VIM, Automatic formatting, Code-Guidelines, C++

I want to be able to automatically format code for the following rules using vim:
Rule 1): If expressions which are must be indeneted with 3 spaces. Example:
if(a &&
b)
(Note: b has three space-indent relative to the parent if, note that current vim behavior is 4)
Rule 2): parameters separated by space. Example:
function_call(a, b, c);
Rule 3): No space between assignment operators. Example:
int a=x;
Rule 4): Reference/dereference operator is attached to variable name not type. Example:
int &x = b;
Where possible, I want vim to do this stuff automatically as I am typing, however if this not possible, identifying formatting that is counter to the above rules (by marking them as errors) will also be helpful.
You can set auto-indentation rules in a custom indent file. Check out examples in the "indent" directory, somewhere like /usr/share/vim/vim74/indent, or in the Vim source code distribution.
You can set error highlighting rules in a custom syntax file. Find examples in the "syntax" directory, somewhere like /usr/share/vim/vim74/syntax, or again in the Vim source code distribution. Here's an example for JSON files:
" Syntax: Decimals smaller than one should begin with 0 (so .1 should be 0.1).
syn match jsonNumError "\:\#<=[[:blank:]\r\n]*\zs\.\d\+"
If you want to actually re-format code automatically as you go you might need a special plugin like vim-autoformat and/or an external tool like ClangFormat.
Regarding indenting, and so on, check the options :h 'sw', :h 'cindent', :h 'cinoptions'...
Regarding where spaces and newlines shall be inserted,
For code already typed, clang-format is indeed the best way to go to reformat code. There is a plugin for vim.
For snippets, brackets and so on, lately I've worked on a plugin aimed at formatting text inserted by other plugins. Excesivelly inspired, I'm named the core plugin lh-style. It's used by mu-template (my snippet/templating plugin), and lh-brackets.
For other stuff you'll want to reformat on the fly, it'll be a little bit more complex. May be lh-style could help, I don't know, I haven't given much though on the subject yet.
For instance, outside comments and strings, = shall be expanded into :
itself after a [ (lamdbas),
<BS>=<space>, after =, >, <, ! followed by a space
<space>=<space> otherwise
EDIT: I got it all wrong, it does exactly the contrary of what you're looking for.
It'd be something like:
" ftplugin/c/mymappings.vim
function! s:InsertExpr(char) abort
let col = col('.')
let line = getline('.')
let syn = synIDattr(synID(line('.'),col-1,1),'name')
if syn =~? 'comment\|string\|character\|doxygen'
return a:key
endif
let lcut = getline('.')[: col-2]
let before =
\ lcut =~ '[=<>!] $' ? "\<bs>"
\ : lcut =~ "[=<>![ \t\n]$" ? ''
\ : ' '
let after = line[col-1] =~ "[ \t\n\\]]" ? '' : ' '
return before.a:char.after
endfunction
inoremap <buffer> <expr> = <sid>InsertExpr('=')
inoremap <buffer> <expr> < <sid>InsertExpr('<')
inoremap <buffer> <expr> > <sid>InsertExpr('>')

Regex for lexing first and second string (separately) in a pair

I'm trying to write a lexer to parse a file like that looks this:
one.html /two/
one/two/ /three
three/four http://five.com
Each line has two strings separated by a space. I need to create two regex patterns: one to match the first string, and another to match the second string.
This is my attempt at the regex for the lexer (a file named lexer.l to be run by flex):
%%
(\S+)(?:\s+\S+) { printf("FIRST %s\n", yytext); }
(?:\S+\s+)(\S+) { printf("SECOND %s\n", yytext); }
. { printf("Mystery character %s\n", yytext); }
%%
I have tested both (\S+)(?:\s+\S+) and (?:\S+\s+)(\S+) in the Regex101 tester and they both seem to be working properly: https://regex101.com/r/FQTO15/1
However, when i try to build the lexer by running flex lexer.l, I get an error:
lexer.l:3: warning, rule cannot be matched
This is referring to the second rule I have. If I attempt to reverse the order of the rules, I get the error on the second one yet again. If I only leave in one of the rules, it works perfectly fine.
I believe this issue has to do with the fact that both regexes are similar and of the same length, so flex sees it as ambiguous, even though the two regexes capture different things (but they match the same things?).
Is there anything I can do with the regex so that it will capture/match what I want without clashing with each other?
EDIT: More Test Examples
one.html /two/
one/two.html /three/four/
one /two
one/two/ /three
one_two/ /three
one%20two/ /three
one/two/ /three/four
one/two /three/four/five/
one/two.html http://three.four.com/
one/two/index.html http://three.example.com/four/
one http://two.example.com/three
one/two.pdf https://example.com
one/two?query=string /three/four/
go.example.com https://example.com
EDIT
It turns out that the regex engine used by flex is rather limited. It cannot do grouping and it also doesn't seem to use \s for spaces.
So this wouldn't work:
^.*\s.*$
But this does:
^.*" ".*$
Thanks to #fossil for all their help.
Although there are ways to solve your problem as stated, I think you would be better off understanding the intended use of (f)lex, and to find a solution consistent with its processing model.
(F)lex is intended to split an input into individual tokens. Each token has a type, and it is expected that it is possible to figure out the type of a token simply by looking at it (and not at its context). The classic model of a token type are the objects in a computer program, where we have, for example, identifiers, numbers, certain keywords, and various operators. Given an appropriate set of rules, a (f)lex scanner will take an input like
a = b*7 + 2;
and produce a stream of tokens:
identifier = identifier * number + number ;
Each of these tokens has an associated "semantic value" (which not all of them actually require), so that the two identifier tokens and the two number are not just anonymous blobs.
Note that a and b in the above line have different roles. a is being assigned to, while b is being referred to. But that's not relevant to their form, and it is not evident from their form. They are just tokens. Figuring out what they mean and their relationship with each other is the role of a parser, which is a separate part of the parsing model. The intention of the two-phase scan/parse paradigm is to simplify both tasks by abstracting away complications: the scanner knows nothing about context or meaning, while the parser can deduce the logical structure of the input without concerning itself with the messy details of representation and irrelevant whitespace.
In many ways, your problem is a bit outside of this paradigm, in part because the two token types you have cannot be distinguished on the basis of their appearance alone. If they have no useful internal structure, though, then you could just accept that your input consists of
"paths", which do not contain whitespace, and
newline characters.
You could then use a combination of a lexer and a parser to break the input into lines:
File splitter.l
%{
#include "splitter.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
File splitter.y
%code {
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror(const char* msg);
}
%code requires {
#define YYSTYPE char*
}
%token PATH
%%
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
%%
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
int main(int argc, char** argv) {
return yyparse();
}
Quite a lot of the above is boiler-plate; it's worth concentrating just on the grammar and the token patterns.
The grammar is very simple:
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
The interesting line is the last one, which says that a line consists of two PATHs. That handles each line by printing it out, although you'd probably want to do something different. It is this line which understands that the first word on a line and the second word on the same line have different functions. Note that it doesn't need the lexer to label the two words as "FIRST" and "SECOND", since it can see that all by itself :)
The two calls to free release the memory allocated by strdup in the lexer, thus avoiding a memory leak. In a real application, you'd need to make sure you don't free the strings until you don't need them any more.
The lexer patterns are also very simple:
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
The first one returns a special single-character token, a newline character, to for the end-of-line token. The second one matches any string of non-whitespace characters. ((F)lex doesn't know about GNU regex extensions, so it doesn't have \s and friends. It does, however, have the much more readable Posix character classes, which are listed in the flex manual, among other places. The third pattern skips any whitespace. Since \n was already handled by the first pattern, it cannot be matched here (which is why this pattern is a single whitespace character and not a repetition.)
In the second pattern, we assign a value to yylval, which is the semantic value of the token. (We don't do this elsewhere because the newline token doesn't need a semantic value.) yylval always has type YYSTYPE, which we have arranged to be char* by a #define. Here, we just set it from yytext, which is the string of characters (f)lex has just matched. It is important to make a copy of this string because yytext is part of the lexer's internal structure, and its value will change without warning. Having made a copy of the string, we are then obliged to ensure that the memory is eventually released.
To try this program out:
bison -o splitter.tab.c -d splitter.y
flex -o splitter.lex.c splitter.l
gcc -Wall -O2 -o splitter splitter.tab.c splitter.lex.c

Setting up rules for Flex, warning:"rule cannot be matched"

I have these flex rules:
^User-Agent: [^\n]*Firefox {useragent = TFIREFOX; }
^User-Agent: [^\n]*MSIE {useragent = TMSIE; }
^User-Agent: [^\n]*Opera {useragent = TOPERA; }
^User-Agent: [^\n]*Safari {guseragent = TSAFARI; }
...
I get warnings: rule cannot be matched on all lines after the first rule. I expect the first rule to match just lines, with "Firefox" in them but I think Im wrong. How to repair these rules? I read flex manpage and I'm still helpless.
I believe the issue here is that flex uses spaces to delimit tokens for regex matching. So when it parses your file it is treating everything after "^User-Agent:" as part of the action. You can make this work by escaping the space:
^User-Agent:\ [^\n]*Firefox
^User-Agent:\ [^\n]*MSIE
^User-Agent:\ [^\n]*Opera
^User-Agent:\ [^\n]*Safari
I tested with flex 2.5.35, will do what you want.