Raku: Mutually recursive tokens cause a "method not found" error - regex

I've simplified a more complicated pattern I'm trying to match down to the following program:
my token paren { '(' <tok> ')' }
my token tok { <paren>? foo }
say "(foo)foo" ~~ /<tok>/;
This seems straightforward enough to me, but I get this error:
No such method 'tok' for invocant of type 'Match'. Did you mean 'to'?
in regex paren at a.p6 line 1
in regex tok at a.p6 line 2
in block <unit> at a.p6 line 4
What's the reason for this error?
The pattern matches without error if I change the first <tok> to <&tok>, but then I don't get a capture of that named pattern, which, in my original, more-complicated case, I need.

The problem is that tok isn't in the current lexical namespace yet, so <tok> gets compiled as a method call instead.
If you force it to be a lexical call with &, it works.
my token paren { '(' <&tok> ')' }
my token tok { <paren>? foo }
say "(foo)foo" ~~ /<tok>/;
If <…> starts with anything other than a letter it doesn't capture.
So to capture it under the name tok, we add tok= to the <…>
my token paren { '(' <tok=&tok> ')' }
my token tok { <paren>? foo }
say "(foo)foo" ~~ /<tok>/;

Brad's answer is correct, but I wanted to offer another possible solution: you can recurse within a single regex using the special <~~> token. Using that, plus a regular named capture, would create the captures you want in the simplified example in your question; here's how that would look:
my token paren { $<tok> = ['(' <~~> ')']? foo }
I'm not sure whether your more complicated case can be as easily re-written to recurse on itself rather than with two mutually recursive tokens. But, when this pattern works, it can simplify the code quite a bit.

Related

Comment pattern match in flex using states

I am trying to match single line comment pattern in flex. Patterns of the comment could be:
//this is a single /(some random stuff) line comment
Or it could be like this:
// this is also a comment\
continuation of the comment from previous line
From the example it's obvious that I have to handle the multi-line case too.
Now my approach was using states. This is what I have so far:
"//" {
yymore();
BEGIN (SINGLE_COMMENT);
}
<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
yymore();
}
<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE}) {
logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
BEGIN (INITIAL);
}
NEWLINE is declared as:
NEWLINE \r?\n
My declaration unit:
%option noyywrap
%x SINGLE_COMMENT
int line_count = 1;
const int bucketSize = 10; // change if necessary
ofstream logout;
ofstream tokenout;
SymbolTable symbolTable(bucketSize);
Action of NEWLINE:
{NEWLINE} {
line_count++;
}
If I run it with the following input:
// hello\
int main
This is my log file:
Line no 1: TOKEN <COMMENT> Lexeme // hello\
found
Line no 1: TOKEN <INT> Lexeme int found
Line no 1: TOKEN <ID> Lexeme main found
ScopeTable # 1
6 --> < main , ID >
So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?
Link to code
In (f)lex, as in most regular expression engines, [ and ] enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^ and match exactly one character which is not a member of the set.
Character classes are not the same as sequences of characters:
ab matches an a followed by a b
[ab] matches either an a or a b
Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*, +, ?, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:
a* matches 0 or more as
[a*] matches either an a or a *
One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}. Here the { and } indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:
{identifier} matches whatever the expanded macro named identifier would match.
[{identifier}] matches a single character which is {, } or one of the letters definrt
Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.
It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p executes, then there cannot be any pattern which would match more than p. Thus, it might not be necessary to explicitly write the negation.
For example, in your comment detector where the only real issue is dealing with carriage return (\r) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:
<SINGLE_COMMENT>{
[^\\\r\n]+ ;
\\\r?\n { ++line_count; }
\\. ; /* only matches if the above rule doesn't */
\r?\n { ++line_count; BEGIN(INITIAL); }
\r ; /* only matches if the above rule doesn't */
}
By the way, it's usually much easier to provide %option yylineno than to try to track newlines manually.

Word 'if' interpreted as 'if()' function call. Parens not allowed

So I discovered that writing an if statement with parentheses in Perl 6 results in it throwing this error at me:
===SORRY!===
Word 'if' interpreted as 'if()' function call; please use whitespace instead of parens
at C:/test.p6:8
------> if<HERE>(True) {
Unexpected block in infix position (two terms in a row)
at C:/test.p6:8
------> if(True)<HERE> {
This makes me assume that there is some sort of if() function? However, creating and running a script with if(); in it produces the following compiler error:
===SORRY!===
Undeclared routine:
if used at line 15
So like what's the deal?
I read here https://en.wikibooks.org/wiki/Perl_6_Programming/Control_Structures#if.2Funless that parens are optional but that seems to not to be the case for me.
My if statements do work without parens just wondering why it would stop me from using them or why it would think that if is a subroutine because of them.
EDIT: Well aren't I a bit daft... looks like I wasn't reading well enough at the link I linked which I assume is why you are confused. The link I linked points out the following which was basically what I was asking:
if($x > 5) { # Calls subroutine "if"
}
if ($x > 5) { # An if conditional
}
I've accepted the below answer as it does provide some insight.
Are you sure you created a sub with the name 'if'? If so, (no pun intended), you get the keyword if you use a space after the literal 'if', otherwise you get your pre-declared function if you use a paren after the literal 'if' - i.e. if your use of the term looks like a function call - and you have declared such a function - it will call it;
use#localhost:~$ perl6
> sub if(Str $s) { say "if sub says: arg = $s" };
sub if (Str $s) { #`(Sub|95001528) ... }
> if "Hello World";
===SORRY!=== Error while compiling <unknown file>
Missing block
at <unknown file>:1
------> if "Hello World"⏏;
expecting any of:
block or pointy block
> if("Hello World");
if sub says: arg = Hello World
>
> if 12 < 16 { say "Excellent!" }
Excellent!
>
You can see above, I've declared a function called 'if'.
if "Hello World"; errors as the space means I'm using the keyword and therefore we have a syntax error in trying to use the if keyword.
if("Hello World") successfully calls the pre-declared function.
if 12 < 18 { say "Excellent!" } works correctly as the space means 'if' is interpreted as the keyword and this time there is no syntax error.
So again, are you sure you have (or better - can you paste here) your pre-declared 'if' function?
The reference for keywords and whitespace (which co-incidentally uses the keyword 'if' as an example!) is here: SO2 - Keywords and whitespace

Why is the flex regex being skipped?

I can't, for the life of me, figure out what's wrong with my regex's.
What I'd like to tokenize are two (2) types of strings, both of which to be contained on a single line. One string can be anything (other than a new line), and the other, any alpha-numeric (ASCII) character and literal '_', '/' '-', and '.'.
The snippet of flex code is:
nl \n|\r\n|\r|\f|\n\r
...
%%
...
\"[^\"]+{nl} { frx_parser_error("Label is missing trailing double quote."); }
\"[a-zA-Z0-9_\.\/\-]+\" {
if (yyleng > 1024) frx_parser_error("File name too long.");
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_FILENAME: %s\n", frx_parser_lval.str);
return (TOSP_FILENAME);
}
\"[^{nl}]+\" {
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_IDENTIFIER:\n%s\n", frx_parser_lval.str);
return (TOSP_IDENTIFIER);
}
And when I run the parser, the fprintf's spit this out:
TOSP_FILENAME: ModStar-Picture-Analysis.txt
TOSP_FILENAME: ModStar-Rubric.log.txt
TOSP_IDENTIFIER:
picture-A"
Progress (26,255) camera 'C' root("picture-C-
Syntax (line 34): syntax error
For whatever reason, the quote after picture-A is being ... missed. Why? I checked the ASCII values for the eight locations the quote character appears and they're all 0x22 (where the double quutoes appear that is).
If I add some characters to the end of the "picture-A" it can work sometimes; adding ".par", ".pbr" doesn't work as expected, but ".pnr" does.
I've even added a specific non-regexy token:
\"picture-A\" { frx_parser_lval.str = strdup("picture-A"); return TOSP_FILENAME; }
to the lex file and it gets skipped.
I'm using flex 2.5.39, no flex libraries, one option (%option prefix=frx_parser_) in the lex file and the flex command line is:
flex -t script-lexer.l > script-lexer.c
What gives?
EDIT I need to test this on the actual system, but unit tests show this tokenizer to be much more robust (based on rici's answer):
nl \n|\r\n|\r|\f|\n\r
...
%%
...
["][^"]+{nl} { printf("Missing trailing quote.\n%s\n",yytext); }
["][[:alnum:]_./-]+["] { printf("File name:\n%s\n",yytext); }
["][^"]+["] { printf("String:\n%s\n",yytext); }
EDIT The rule ["].+["] swallows consecutive multiple strings as one big string. It was changed to ["][^"]+["]
The problem is your pattern:
\"[^{nl}]+\"
You're attempting to expand a definition inside a character class, but that is not possible; inside a character class, { is always just a {, not a flex operator. See the flex manual:
Note that inside of a character class, all regular expression operators lose their special meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning of the class, ‘^’.
A definition is not a macro. Rather, a definition defines a new regular expression operator.
As a consequence of the above, you can write [^\"] as simply [^"] and \"[a-zA-Z0-9_\.\/\-]+\" as \"[a-zA-Z0-9_./-]+\" (The - needs to be either at the end or at the beginning.) Personally, I'd write the second pattern as:
["][[:alnum:]_./-]+["]
But everyone has their own style.

vim indentation braces inside parentheses

In vim (eg 7.3), how can I use/modify the cindent or smartindent options, or otherwise augment my .vimrc, in order to automatically indent curly braces inside open parentheses to align to the first "word" (defined later) directly preceding the opening (?
The fN option seems promising, but appears to be overridden by the (N option when inside open parentheses. From :help cinoptions-values:
fN Place the first opening brace of a function or other block in
column N. This applies only for an opening brace that is not
inside other braces and is at the start of the line. What comes
after the brace is put relative to this brace. (default 0).
cino= cino=f.5s cino=f1s
func() func() func()
{ { {
int foo; int foo; int foo;
Current behavior:
func (// no closing )
// (N behavior, here N=0
{ // (N behavior overrides fN ?
int foo; // >N behavior, here N=2
while I wish for:
func (// no closing )
// (N behavior as before
{ // desired behavior
int foo; // >N behavior still works
What I am asking for is different from fN because fN aligns to the prevailing indent, and I want to align to any C++ nested-name-specifier that directly precedes the opening (, like
code; f::g<T> ( instead of code; f::g<T> (
{ {
If there is no nested-name-specifier, I'd like it to match the ( itself. Perhaps matching a nested-name-specifier is too complicated, or maybe there is another part of the grammer this is more appropriate for this scenario. Anyway, for my typical use case, I think I'd be satisfied if the { aligns with the first nonwhitespace character of the maximal sequence of characters to the left of the innermost unclosed (, inclusive, that does not contain any semicolons or left curly braces }.
By the way, I arrived at this when trying to autoindent various std::for_each(b,e,[]{}); constructs in vim7.3. Thanks for your help!
Not sure that any of the {auto,smart,c}indent features could be finagled to do what you want. I made up a mapping which might give some inspiration:
inoremap ({ <esc>T<space>y0A<space>(<cr><esc>pVr<space>A{
Downsides are that you may need to do something smarter than 'T' to get back to the beginning of the last identifier (you could use '?' with a regex), that it trashes your default register, and that if your identifier before the paren is at the start of the line you have to do '({' yourself. The notion is to jump back to just before the identifier, copy to the beginning of the line, paste that to the next line, and replace every character with a space.
Good luck!

What is the semicolon in C++?

Roughly speaking in C++ there are:
operators (+, -, *, [], new, ...)
identifiers (names of classes, variables, functions,...)
const literals (10, 2.5, "100", ...)
some keywords (int, class, typename, mutable, ...)
brackets ({, }, <, >)
preprocessor (#, ## ...).
But what is the semicolon?
The semicolon is a punctuator, see 2.13 §1
The lexical representation of C++ programs includes a number of preprocessing tokens which are used in
the syntax of the preprocessor or are converted into tokens for operators and punctuators
It is part of the syntax and therein element of several statements. In EBNF:
<do-statement>
::= 'do' <statement> 'while' '(' <expression> ')' ';'
<goto-statement>
::= 'goto' <label> ';'
<for-statement>
::= 'for' '(' <for-initialization> ';' <for-control> ';' <for-iteration> ')' <statement>
<expression-statement>
::= <expression> ';'
<return-statement>
::= 'return' <expression> ';'
This list is not complete. Please see my comment.
The semicolon is a terminal, a token that terminates something. What exactly it terminates depends on the context.
Semicolon denotes sequential composition. It is also used to delineate declarations.
Semicolon is a statement terminator.
The semicolon isn't given a specific name in the C++ standard. It's simply a character that's used in certain grammar productions (and it just happens to be at the end of them quite often, so it 'terminates' those grammatical constructs). For example, a semicolon character is at the end of the following parts of the C++ grammar (not necessarily a complete list):
an expression-statement
a do/while iteration-statement
the various jump-statements
the simple-declaration
Note that in an expression-statement, the expression is optional. That's why a 'run' of semicolons, ;;;;, is valid in many (but not all) places where a single one is.
';'s are often used to delimit one bit of C++ source code, indicating it's intentionally separate from the following code. To see how it's useful, let's imagine we didn't use it:
For example:
#include <iostream>
int f() { std::cout << "f()\n"; }
int g() { std::cout << "g()\n"; }
int main(int argc)
{
std::cout << "message"
"\0\1\0\1\1"[argc] ? f() : g(); // final ';' needed to make this compile
// but imagine it's not there in this new
// semicolon-less C++ variant....
}
This (horrible) bit of code, called with no arguments such that argc is 1, prints:
ef()\n
Why not "messagef()\n"? That's what might be expected given first std::cout << "message", then "\0\1\0\1\1"[1] being '\1' - true in a boolean sense - suggests a call to f() printing f()\n?
Because... (drumroll please)... in C++ adjacent string literals are concatenated, so the program's parsed like this:
std::cout << "message\0\1\0\1\1"[argc] ? f() : g();
What this does is:
find the [argc/1] (second) character in "message\0\1\0\1\1", which is the first 'e'
send that 'e' to std::cout (printing it)
the ternary operator '?' triggers casting of std::cout to bool which produces true (because the printing presumably worked), so f() is called...!
Given this string literal concatenation is incredibly useful for specifying long strings
(and even shorter multi-line strings in a readable format), we certainly wouldn't want to assume that such strings shouldn't be concatenated. Consequently, if the semicolon's gone then the compiler must assume the concatenation is intended, even though visually the layout of the code above implies otherwise.
That's a convoluted example of how C++ code with and with-out ';'s changes meaning. I'm sure if I or other readers think on it for a few minutes we could come up with other - and simpler - examples.
Anyway, the ';' is necessary to inform the compiler that statement termination/separation is intended.
The semicolon lets the compiler know that it's reached the end of a command AFAIK.
The semicolon (;) is a command in C++. It tells the compiler that you're at the end of a command.
If I recall correctly, Kernighan and Ritchie called it punctuation.
Technically, it's just a token (or terminal, in compiler-speak), which
can occur in specific places in the grammar, with a specific semantics
in the language. The distinction between operators and other punctuation
is somewhat artificial, but useful in the context of C or C++, since
some tokens (,, = and :) can be either operators or punctuation,
depending on context, e.g.:
f( a, b ); // comma is punctuation
f( (a, b) ); // comma is operator
a = b; // = is assignment operator
int a = b; // = is punctuation
x = c ? a : b; // colon is operator
label: // colon is punctuation
In the case of the first two, the distinction is important, since a user
defined overload will only affect the operator, not punctuation.
It represents the end of a C++ statement.
For example,
int i=0;
i++;
In the above code there are two statements. The first is for declaring the variable and the second one is for incrementing the value of variable by one.