How do I parse this correctly with spirit? - c++

My situation: I'm new to Spirit, I have to use VC6 and am thus using Spirit 1.6.4.
I have a line that looks like this:
//The Description;DESCRIPTION;;
I want to put the text DESCRIPTION in a string if the line starts with //The Description;.
I have something that works but looks not that elegant to me:
vector<char> vDescription; // std::string doesn't work due to missing ::clear() in VC6's STL implementation
if(parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+~ch_p(';'))[assign(vDescription)]
),
// End grammar
space_p).hit)
{
const string desc(vDescription.begin(), vDescription.end());
}
I would much more like to assign all printable characters up to the next ';' but the following won't work because parse(...).hit == false
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+print_p)[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)
How do I make it hit?

You might try using confix_p:
confix_p(as_lower_d["//the description;"],
(+print_p)[assign(vDescription)],
ch_p(';')
)
It should be equivalent to Fred's response.
The reason your code fails is because print_p is greedy. The +print_p parser will consume characters until it encounters the end of the input or a non-printable character. Semicolon is printable, so print_p claims it. Your input gets exhausted, the variable is assigned, and the match fails — there's nothing left for the last semicolon of your parser to match.
Fred's answer constructs a new parser, (print_p - ';'), which matches everything print_p does, except for semicolons. "Match everything except X, and then match X" is a common pattern, so confix_p is provided as a shortcut for constructing that kind of parser. The documentation suggests using it for parsing C- or Pascal-style comments, but that's not required.
For your code to work, Spirit would need to recognize that the greedy print_p matched too much and then backtrack to allow matching less. But although Spirit will backtrack, it won't backtrack to the "middle" of what a sub-parser would otherwise greedily match. It will backtrack to the next "choice point," but your grammar doesn't have any. See Exhaustive backtracking and greedy RD in the Spirit documentation.

You're not getting a hit because ';' is matched by print_p. Try this:
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+(print_p-';'))[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

Unexpected behaviour when parsing a string with optional Suffix in antlr4

I want to match multiple Functions to accept a comma-seperated List of placeholders and then the definition of a Unit, which is again seperated by a comma from the rest of the arguments. The text to parse would look like example 1: "produkt([F1],[F2],EURO_CENT)" or example 2:"produkt([F1],[F2],EURO)"
The grammar for this like i would expect it to work is this:
[...]
term: [...]
| 'produkt(' placeholder ',' placeholder ',' UNIT ')' #MultUnit
[...]
| placeholder #PlaceholderTwo
;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
LBRACK: '[';
RBRACK: ']';
PLACE: TEXT+ NUMBER?;
placeholder: LBRACK PLACE+ RBRACK;
[..]
UNIT: TEXT (('_' TEXT)*)?;
TEXT: ('a' .. 'z' | 'A' .. 'Z')+;//[a-zA-Z]+;
[...]
With this grammar example 1 works as expected but example 2 gives me the error "line 1:18 mismatched input 'EURO' expecting UNIT". As i understand it this means that "EURO" itself does not match the pattern for UNIT but "EURO_CENT" does. I do not understand why this is the case because the pattern for UNIT says that the "_CENT" part is optional and only the first part is mandatory.
I also tried to give the UNIT some Prefix (in this case "Unit.") by changing the pattern for Unit to UNIT: 'Unit.' TEXT ('_' TEXT)*;
I changed the input string to "produkt([F1],[F2],Unit.EURO)" accordingly and this matches like a charme.
However the second approach is not very userfriendly since we have to add something (in our opinion) unnecessary to the input. So the question is: why does the first option not match as expected when the UNIT-String is a single word and is there a workaround for it?
The short answer is that PLACE and UNIT are mutually ambiguous for content that only matches TEXT. If the sample inputs are canonical, then change the PLACE rule to remove the ambiguity:
PLACE : TEXT+ NUMBER ;
Other possibilities include redefining PLACE as
PLACE : LBRACK TEXT+ NUMBER? RBRACK; // adjust other rules accordingly
adding a predicate to the rule:
PLACE : {followsLBRACK()}? TEXT+ NUMBER ;
and redefining UNIT:
UNIT: TEXT ( 'S' | ( '_' TEXT )+ ) ; // EUROS or EURO_CENT; similar for other units.
BTW, Antlr generally evaluates its grammars top-down, so mixing your rules as you have actually obfuscates the logic.

In Boost Spirit Qi, how do I match every character up to the next whitespace (with pre-skip)

Within a boost::spirit::qi grammar rule, how do you match a string of characters up to and excluding the next whitespace character, as defined by the supplied skipper?
For example, if the grammar is a set of attributes defined as:
attributeList = '(' >> *attribute >> ')';
attribute = (name : value) | (name : value units);
How do I match any character for name up to and excluding the first skipper character?
For example, for name, I would like to pre-skip, then match all characters except ':' or a skipper character. Do I have to instantiate a skipper within the grammar class, so that I can do something like:
name = +qi::char_ !(skipper | ':');
or can I access the existing supplied skipper object somehow and reference it directly? Also, I don't believe this needs to be wrapped in qi:: lexeme[]...
Thanks in advance for correcting the error of my ways
In order to do this, you'll need to suppress skipping, so qi::lexeme will have to be involved (or at least qi::no_skip, but you'd only use it to reimplement qi::lexeme), and to do precisely what you write you'll also need the skip parser. Then you could write
qi::lexeme[ +(qi::char_ - ':' - skipper) ]
The requirements seem rather lax, though. It is unusual to allow even non-printable characters such as the bell sign (ASCII 7) in identifiers. I don't know what exactly you're trying to do, so I can't answer such design questions for you, but to me it seems like there's a good chance you'd be happier with a more standard rule such as
qi::lexeme[ qi::alpha >> *qi::alnum ]
(for a very simple example. Your mileage may vary wrt underscores etc.)

Perl String Regular Expression - Need Explanation

I am pretty new to Perl. I have the following code fragment that works just fine, but I don't fully understand it:
for ($i = 1; $i <= $pop->Count(); $i++) {
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
}
$pop->Head is a string or an array of strings returned by the function Mail::POP3Client, and it is the headers of a bunch of emails. Line 3 is some kind of regular expression that extracts the FROM and the SUBJECT from the header.
My question is how does the print function only print the From and the Subject without all the other stuff in the header? What does "and" mean - this surely can't be a boolean and can it? Most important, I want to put the From string into its own variable (my $fromline). How do I do this?
I am hoping that this will be easy for some Perl professional, it has got me baffled!
Thanks in advance.
ARGHHH... The question was edited while I was typing the answer. OK, throwing out the part of my answer that's no longer relevant, and focusing on the specific questions:
The outer loop iterates over all the messages in the mailbox.
The inner loop doesn't specify a loop variable, so the special variable $_ is used.
In each iteration through the inner loop, $_ is one header line from message number $i.
/^(From|Subject):\s+/i and print $_, "\n";
The first part of this line, up to the and is a pattern. We didn't specify what to do with the pattern, so it's implicitly matched against $_. (That's one of the things that makes $_ special.) This gives us a yes/no test: does the pattern match the header line or not?
The pattern tests whether that item begins with (<) either of the words "From" or "Subject", followed immediately by a colon and one or more whitespace characters. (This not the correct pattern to match an RFC 822 header. Whitespace is optional on both sides of the colon. The pattern should more properly be /^(From|Subject)\s*:\s*/i. But that's a separate issue.) the i at the end of the pattern says to ignore case, so from or SUBJECT would be OK.
The and says to continue evaluating (i.e., executing) the expression if there is a match. If there's no match, whatever follows and is ignored.
The rest of the expression prints the header line ($_) and a newline ("\n").
In perl, and and or are boolean operators. They're synonyms for && and ||, except that they have much lower precedence, making it easier to write short-ciruit expressions without clutter from lots of parentheses.
The smallest change that captures the From line into a separate variable would be to add the following line to the inner loop:
/^From\s*:\s*(.*)$/i and $fromline = $1;
You should probably also put
$fromline = undef
before the loop so you can test, after the loop, whether there was a From: line.
There are other ways to do it. In fact, that's one of the mantras of perl: "There's more than one way to do it." I've stripped out the "From: " from the beginning of the line before storing the balance in $fromline, but I don't know your needs.
It's a logical and with short-circuiting. If the left side evaluates to true -- say, if that regular expression matches -- it'll evaluate the right side, the print.
If the expression on the left is false, it doesn't need to evaluate the right hand side, because the net result would still be false, so it skips it.
See also: perldoc perlop

Writing regular expressions and rules in Sublime Text 2 syntax definitions

I'm very interested in Syntax Definitions for Sblime text 2
I've studied the basics but I don't know how to write RE (and rules) for smth like
variable = sentense, i.e. myvar = func(foo, bar) + baz
I can't write anything better than ^\s*([^=\n]+)=([^=\n]+\n) (that doesn't work)
How to write this RE in proper way?
Also, i have some difficulties in defining RE for block
IF i FROM .. TO ..
...
ELSE
...
END IF
Hoe to write it?
In this case you have to write a parser. A regex won't work because the patterns may vary. You've already noticed it when you stated 'variable = sentence'.
For this, you can use spoofax or javacup for grammar definitions. I'll give you a snip in JavaCup:
Scanner issues: suppose 'variable' follows the pattern: (_|[a-zA-Z])(_|[a-zA-Z])*
and 'number' is: ([0-9])+
Note that number could be any decimal or int, but here I state it as that pattern, supposing my language only deals with integer (or whatever that pattern means :) ).
Now we can declare our grammar following the JavaCUP syntax. Which is more or less like:
expression ::= variable "=" sentence
sentence ::= sentence "+" sentence;
sentence ::= sentence "-" sentence;
sentence ::= sentence "*" sentence;
sentence ::= sentence "/" sentence;
sentence ::= number;
...and that goes further.
If you've never had any compiler's class, it may seems very difficult to see. Plus there is a lot of grammar's restrictions to avoiding infinity loop in the parser, depending on which you're using (RL or LL).
Anyway, the real answer for your question is: you can't do what you want only with regex, i'll need more concepts.