Perl 6 Grammar doesn't match like I think it should - regex

I'm doing Advent of Code day 9:
You sit for a while and record part of the stream (your puzzle input). The characters represent groups - sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group - that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.
Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.
In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.
Of course, this screams out for a Perl 6 Grammar...
grammar Stream
{
rule TOP { ^ <group> $ }
rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }
token garbignore { '!' . }
token garbchar { <-[ !> ]> }
}
This seems to work fine on simple examples, but it goes wrong with two garbchars in a row:
say Stream.parse('{<aa>}');
gives Nil.
Grammar::Tracer is no help:
TOP
| group
| | group
| | * FAIL
| | garbage
| | | garbchar
| | | * MATCH "a"
| | * FAIL
| * FAIL
* FAIL
Nil
Multiple garbignores are no problem:
say Stream.parse('{<!!a!a>}');
gives:
「{<!!a!a>}」
group => 「{<!!a!a>}」
garbage => 「<!!a!a>」
garbignore => 「!!」
garbchar => 「a」
garbignore => 「!a」
Any ideas?

UPD Given that the Advent of code problem doesn't mention whitespace you shouldn't be using the rule construct at all. Just switch all the rules to tokens and you should be set. In general, follow Brad's advice -- use token unless you know you need a rule (discussed below) or a regex (if you need backtracking).
My original answer below explored why the rules didn't work. I'll leave it in for now.
TL;DR <garbchar> | contains a space. Whitespace that directly follows any atom in a rule indicates a tokenizing break. You can simply remove this inappropriate space, i.e. write <garbchar>| instead (or better still, <.garbchar>| if you don't need to capture the garbage) to get the result you seek.
As your original question allowed, this isn't a bug, it's just that your mental model is off.
Your answer correctly identifies the issue: tokenization.
So what we're left with is your follow up question, which is about your mental model of tokenization, or at least how Perl 6 tokenizes by default:
why ... my second example ... goes wrong with two garbchars in a row:
'{<aa>}'
Simplifying, the issue is how to tokenize this:
aa
The simple high level answer is that, in parsing vernacular, aa will ordinarily be treated as one token, not two, and, by default, Perl 6 assumes this ordinary definition. This is the issue you're encountering.
You can overrule this ordinary definition to get any tokenizing result you care to achieve. But it's seldom necessary to do so and it certainly isn't in simple cases like this.
I'll provide two redundant paths that I hope might lead folk to the correct mental model:
For those who prefer diving straight into nitty gritty detail, there's a reddit comment I wrote recently about tokenization in Perl 6.
The rest of this SO answer provides a high level discussion that complements the low level explanation in my reddit comment.
Excerpting from the "Obstacles" section of the wikipedia page on tokenization, and interleaving the excerpts with P6 specific discussion:
Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:
Punctuation and whitespace may or may not be included in the resulting list of tokens.
In Perl 6 you control what gets included or not in the parse tree using capturing features that are orthogonal to tokenizing.
All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.
By default, the Perl 6 design embodies an equivalent of these two heuristics.
The key thing to get is that it's the rule construct that handles a string of tokens, plural. The token construct is used to define a single token per call.
I think I'll end my answer here because it's already getting pretty long. Please use the comments to help us improve this answer. I hope what I've written so far helps.

A partial answer to my own question: Change all the rules to tokens and it works.
It makes sense, because the difference is :sigspace, which we don't need or want here. What I don't understand, though, is why it did work for some input, like my second example.
The resulting code is here, if you're interested.

Related

A Regex to ignore a set of words

Is there a way to set regex to ignore a set of words separated by space?
I have different products names like:
"Matrix 10X, 10 ml + DISPENSER"
"Matrix 10X,10ml + DISPENSER" where the quantity varies
What I'm trying to do is to replace using regex all words except for:
"10 ml" | "10 ML" | "10ml" ---> these are to be ignored
I have found a code to replace all characters except words separated by space (like "10 ml")
https://regex101.com/r/bG8vB4/5
and to replace them when they are together (like "10ml")
https://regex101.com/r/bG8vB4/4
but can find a way to mix them together to keep just "10 ml" OR "10 ML" OR "10ml" and remove other characters up to the end of the string
Regexps are a mathematical model to do efficient computer recognition of strings. As easy as getting a regular expression to match a string if it has any of some words, math demonstrates that the regexp to get a matcher of strings that just matches a string if it has none of those words is possible. The way to get such a regexp, although is far more complex.
On regular expressions theory, a regular language is one that allows you to set a finite automaton from a regular expression, and the automaton that recognizes a string if the original doesn't is feasible by just switching all accept states into non-accepting states. Once done this, the hardest part is to build a regular expression that matches that automaton (that is possible, but the final regular expression is far more complex, in general than the original) This can be solved with an example (a simple one) and you'll see that that is a complex thing (of course, some regexp libraries allow you to use an operand for this, but you don't specify if the one you are using does) One such sample is when you have to recognize a simple C language comment. A comment is a string delimited by the sequences /* and */ but in the inner part, you cannot have the sequence */.
The first approach could be to use the following regexp:
\/\*.*\*\/
but that fails, as the inner regexp includes the recognition of */ as part of it, so /* bla bla bla */ bla bla bla */ will be recognized as a comment in whole (it should end at the first */) so wee need a regexp that recognizes anything but not something that includes */
Such subexpression is:
([^*]|\*[^/])*
which means and undefinite concatenation of characters different that *, or sequences that, including the first character as * are not followed by /. If you follow that concatenation, you'll see that it's impossible to form a sequence */ leading to our final regexp:
\/\*([^*]|\*[^/])*\*\/
(now you see how the things complicate)
To extend this to a single word (as word, more than two letters) you have to consider that you can allow:
([^w]|w[^o]|wo[^r]|wor[^d])*
in the set, and if you have two words (like foo and bar) you have to write:
([^f]|f[^o]|fo[^o]|[^b]|b[^a]|ba[^r])*
meaning that for each word you have such regexps, making the final regexp a bit complicated. Also, there can be interactions between words if some can be the prefix to another or some have the same prefix chars. This also can have the problem that the compilation of regexps into finite automata has produced many libraries that consider the | operator non conmutative and resolve them in a non conmutative way, leading to erroneous results.
You have not explained also what you mean with ignoring. If you mean matching them and pass around, is different to mean to ignore the whole line they could appear on. The regexps then (an the definition of the problem you need to solve is quite different ---my explanation was in the sense of rejecting a full sentence if it has any of the words on it, which probably is not what you mean) So please, explain (in your question) what do you mean with:
accepting you have matched a sentence containing a word.
rejecting such a sentence.
what are you rejecting (or ignoring) at all.
Rejecting just a word, is simply selecting a sencence that contains that word, and mark the word to be able to pass over it. But that's a different problem, and it requires to select sentences that do have the word.

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Conditional regular expression with one section dependent on the result of another section of the regex

Is it possible to design a regular expression in a way that a part of it is dependent on another section of the same regular expression?
Consider the following example:
(ABCHEHG)[HGE]{5,1230}(EEJOPK)[DM]{5}
I want to continue this regex, and at some point I will have a section where the result of that section should depend on the result of [DM]{5}.
For example, D will be complemented by C, and M will be complemented by N.
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5}[D'M']{5}
By D' I mean C, and by M' I mean N.
So a resulting string that matches the above regex, if it has DDDMM matching to the section [DM]{5}, it should necessarily have CCCNN matching to [D'M']{5}. Therefore, the result of [D'M']{5} always depends on [DM]{5}, or in other words, what matches to [DM]{5} always dictates what will match to [D'M']{5}.
Is it possible to do such a thing with regex?
Please note that, in this example I have extremely over-simplified the problem. The regex pattern I currently have is really much more complex and longer and my actual pattern includes about 5-6 of such dependent sections.
I cannot think of a way you can do this in pure regex. I would run 2 regex expressions. The first regex to extract the [DM]{5} string, such as
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}
And take the last 5 characters. Now replace the characters, for example in C# it would be result = result.Substring(result.Length - 5, 5).Replace('D', 'C').Replace('M', 'N'), and then concatenate like
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5} + result
This is pretty easy to do in Perl:
m{
ABCHEHG
[HGHE]{5,1230}
EEJOPK
( [DM]{5} )
[ACF]{1,1000}
BBBA
[CU]{2,5}
(??{ $1 =~ tr/DM/CN/r })
}x
I've added the x modifier and whitespace for better readability. I've also removed the capturing groups around the fixed strings (they're fixed strings; you already know what they're going to capture).
The crucial part is that we capture the string that was actually matched by [DM]{5} (in $1), which we then use at the end to dynamically generate a subpattern by replacing all D by C and M by N in $1.
This sounds like bioinformatics in python. Do 2-stage filtering, at regex level and at app level.
Wildcard the DM portions, so the regex is permissive in what it accepts. Bury the regex in a token generator that yields several matching sections. Have your app iterate through the generator's results, discarding any result rejected by your business logic, such as finding that one token is not the complement of another token.
Alternatively, you might push some of that work down into a complex generated regex, which likely will perform worse and will be harder to debug. Your DDDMM example might be summarized as D+M+, or [DM]+, not sure if sequence matters. The complement might be C+N+, or [CN]+. Apparently there's two cases here. So start assembling a regex: stuff1 [DM]+ stuff2 [CN]+ stuff3. Then tack on '|' for alternation, and tack on the other case: stuff1 [CN]+ stuff2 [DM]+ stuff3 (or factor out suffix and prefix so alternation starts after stuff1). I can't imagine you'll be happy with such an approach, as the combinatorics get ugly, and the regex engine is forced to do lots of scanning and backtracking. And recompiling additional regexes on the fly doesn't come for free. Instead you should use the regex engine for the simple things that it's good at, and delegate complex business logic decisions to your app.

Pattern matching language knowledge, pattern matching approach

I am trying to implement a pattern matching "syntax" and language.
I know of regular expressions but these aren't enough for my scopes.
I have individuated some "mathematical" operators.
In the examples that follow I will suppose that the subject of pattern mathing are character strings but it isn't necessary.
Having read the description bellow: The question is, does any body knows of a mathematical theory explicitating that or any language that takes the same approach implementing it ? I would like to look at it in order to have ideas !
Descprition of approach:
At first we have characters. Characters may be aggregated to form strings.
A pattern is:
a) a single character
b) an ordered group of patterns with the operator matchAny
c) an ordered group of patterns with the operator matchAll
d) other various operators to see later on.
Explanation:
We have a subject character string and a starting position.
If we check for a match of a single character, then if it matches it moves the current position forward by one position.
If we check for a match of an ordered group of patterns with the operator matchAny then it will check each element of the group in sequence and we will have a proliferation of starting positions that will get multiplied by the number of possible matches being advanced by the length of the match.
E.G suppose the group of patterns is { "a" "aba" "ab" "x" "dd" } and the string under examination is:
"Dabaxddc" with current position 2 ( counting from 1 ).
Then applying matchAny with the previous group we have that "a" mathces "aba" matches and "ab" matches while "x" and "dd" do not match.
After having those matches there are 3 starting positions 3 4 5 ( corresponding to "a" "ab" "aba" ).
We may continue our pattern matching by accepting to have more then one starting positions. So now we may continue to the next case under examination and check for a matchAll.
matchAll means that all patterns must match sequentially and are applied sequentially.
subcases of matchAll are match0+ match1+ etc.
I have to add that the same fact to try to ask the question has already helped me and cleared me out some things.
But I would like to know of similar approaches in order to study them.
Please only languages used by you and not bibliography !!!
I suggest you have a look at the paper "Parsing Permutation Phrases". It deals with recognizing a set of things in any order where the "things" can be recognizers themselves. The presentation in the paper might be a little different than what you expect; they don't compile to finite automaton. However, they do give an implementation in a functional language and that should be helpful to you.
Your description of matching strings against patterns is exactly what a compiler does. In particular, your description of multiple potential matches is highly reminiscent of the way an LR parser works.
If the patterns are static and can be described by an EBNF, then you could use an LR parser generator (such as YACC) to generate a recogniser.
If the patterns are dynamic but can still be formulated as EBNF there are other tools that can be applied. It just gets a bit more complicated.
[In Australia at least, Computer Science was a University course in 1975, when I did mine. YACC dates from around 1970 in its original form. EBNF is even older.]

differentiating and testing regex variants

Several implementations of regular expressions differ from each other in subtle ways which is the source of much confusion when I try to use them.
Most of these differences include the semantics related to whether a character is escaped or not. This is most often an issue with parentheses, but can apply to curly brackets and others. This is probably a consequence of the syntax of the language or environment in which the implementation is found. For instance, if the $ symbol indicates a variable name in some language, one can expect regular expressions represented in that language would require escaping the "end of line" anchor to \$ or some such. But what gets confusing at this point is how you would represent an actual dollar sign. I believe Perl gets around this by wrapping a regex inside forward slashes /.
Similarly there are escapes for specific characters themselves, for instance non printing characters such as \n and \t. Then there are the similar looking generic character groups such as \d for digits, \s for whitespace, and \w which I just learned covers underscores as well as digits. I found myself on several occasions trying to use \a for a "alphabetical" group but this only ended up matching the bell character 0x07.
It's pretty clear that there is no simple and one-shot solution to knowing all of the differences in features and syntax offered by the myriad of implementations of regular expressions out there, short of somebody doing all the hard work and putting results in a well organized table. Here is one example of exactly this, but of course it doesn't cover several of the programs that I use extensively myself, which include vim, sed, Notepad++, Eclipse, and believe it or not MS Word (at least version 2010, I suspect 2007 also has this, they call it "wildcards") has a simple regex implementation too.
I guess what I want is to be as lazy as possible (in a certain sense) by trying to come up with a way to determine for any given regex implementation what its "escape settings" are beyond any doubt by applying one (or a few) queries.
I'm thinking I can make a file which contains test cases, along with a huge regex query, and somehow engineer it so that running it once will show me exactly what syntax I need to use subsequently without doubting myself any further. (as opposed to having to edit files and use multiple queries to figure out the same thing which gets terribly old after a while).
If nobody else has attempted to construct such a monstrosity, I may undertake this task myself. If it's even possible. Is this possible?
I tried to come up with an example (it was just to figure out if EOL anchor is $ or \$) but in every case I had to use a multitude of different search/replace queries in order to determine how the program will respond to the input.
Edit: I came up with something using capturing and backtracking. I gotta work on it a little more.
Update: Well, Notepad++ does not implement the OR operator commonly denoted by the pipe |. Word's "wildcards" is a poor substitute also, it doesn't have | or *. I'm fairly certain that missing any of the regular expression operators (union, concat, star) means it cannot generate a regular grammar, so those two are ruled out.
I can create an input file like this:
$
*
]
EOL
and query
(\$)|(\*)|(\[)|($)
replacing with
escDollar:\1:escStar:\2:escSQBrL:\3:Dollar:\4:
yields a result of (assuming unescaped parens is group and unescaped pipe is or)
escDollar:$:escStar::escSQBrL::Dollar::
escDollar::escStar:*:escSQBrL::Dollar::
]escDollar::escStar::escSQBrL::Dollar::
EOLescDollar::escStar::escSQBrL::Dollar::
I ran this in vim. This output would demonstrate the single characters that are matched by each item specified next to it, i.e. the escaped dollar sign item is seen to match the actual dollar sign character rather than the non escaped dollar sign item at the end.
It's difficult to see what's going on with the $ anchor since it matches zero characters, but it shouldn't be hard to find a solution for it. Besides it's not a commonly mistaken one. The ones I'm particularly worried about are pipe and parens and the different brackets. When you've got 4 different types in there there are 2^4 combinations of escaped and non-escaped versions of them you can use. Trial-and-error with that is horrific.
This output isn't too hard to parse at a glance, and is also seriously easy to process as part of a script. The one glaring problem that remains is figuring out whether parens and pipe need to be escaped. Because the functionality of the whole thing depends on them.
It would seem like that will require multiple queries. It may be possible with a cleverly engineered jumble of backslashes, parens, and pipes to figure out the combination (only 4 possibilities after all) with an initial query, then choose the subsequent matrix generator query based on it.
Something like this shows it can work:
(e)
(f)
querying
\((f\))|\|\((e\))
replace with
\1:\2
would produce:
:(e if escaped parens is group and escaped pipe is or
:e) if parens is group and escaped pipe is or
(f: if escaped parens is group and pipe is or
f): if parens is group and pipe is or
I still don't really like this though because it requires a second query on a second set of input. Too much setting up. I may just make 4 copies of the "matrix" thing.
The table on this page summarizes quite nicely which features are available in which regex implementations:
http://www.regular-expressions.info/refflavors.html