Is there a way to use whitespace in BNFC? - regex

How do you use whitespace in a BNFC definition?
For example, suppose I want to produce a parser for the lambda calculus where I allow a list of variables to be abstracted:
\x y z.x z (y z)
The "obvious" thing to do is use a labeled rule like:
ListAbs . Exp ::= "\\" [Ident] "." Exp ;
separator Ident " "
However, BNFC defaults to stripping whitespace, so that does not work. What does work is using a comma separator. A bit uglier, but I could live with it... Still it would be nice to be able to separate by space.
Is there a whitespace character class in BNFC?

You can declare the empty string as separator:
separator Ident ""
In practice this lets you use white-spaces (or any space character) as separator:
$ cat test.cf
A . A ::= [Ident] ;
separator Ident ""
$ bnfc -haskell -m test.cf
$ make
$ echo 'x y z' | ./Testtest
Parse Successful!
[Abstract Syntax]
A [Ident "x",Ident "y",Ident "z"]
[Linearized tree]
x y z

Related

I have written a regex for matching the sub string with spaces around it but that's not working well

Actually I was working on a regex problem whose task is to take a substring (||, &&) and replaces it with another substring (or, and) and I wrote code for it but that's not working well
question = x&& &&& && && x || | ||\|| x
Expected output = x&& &&& and and x or | ||\|| x
Here is the code I wrote
import re
for i in range(int(input())):
print(re.sub(r'\s[&]{2}\s', ' and ', re.sub(r"\s[\|]{2}\s", " or ", input())))
My output = x&& &&& and && x or | ||\|| x
You need to use lookarounds, the problem with the current regex is && && here the && the first match captures the space so there's no space available before the second && and it won't match, so we need to use zero-length-match ( lookarounds)
Replace the regex
\s[&]{2}\s --> (?<=\s)[&]{2}(?=\s)
\s[\|]{2}\s --> (?<=\s)[\|]{2}(?=\s)
(?<=\s) - Match should be precede space characters
(?=\s) - Match should be followed by space characters
You're looking for a regex like (?<=\s)&&(?=\s) (Regex demo)
Using lookarounds to assert the position of space characters around your targeted replacement groups allows overlapping matches to occur - otherwise, it will match the spaces on both sides and block out the other options.
import re
in_str = 'x&& &&& && && x || | ||\|| x'
expect_str = 'x&& &&& and and x or | ||\|| x'
print(re.sub("(?<=\s)\|\|(?=\s)", "or", re.sub("(?<=\s)&&(?=\s)", "and", in_str)))
Python demo
Try using re.findall() instead of re.sub

perl6 regex: match all punctuations except . and "

I read some threads on matching "X except Y", but none specific to perl6. I am trying to match and replace all punctuation except . and "
> my $a = ';# -+$12,678,93.45 "foo" *&';
;# -+$12,678,93.45 "foo" *&
> my $b = $a.subst(/<punct - [\.\"]>/, " ", :g);
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
------> my $b = $a.subst(/<punct⏏ - [\.\"]>/, " ", :g);
Unrecognized regex metacharacter (must be quoted to match literally)
------> my $b = $a.subst(/<punct -⏏ [\.\"]>/, " ", :g);
Unable to parse expression in metachar:sym<assert>; couldn't find final '>' (corresponding starter was at line 1)
------> my $b = $a.subst(/<punct - ⏏[\.\"]>/, " ", :g);
> my $b = $a.subst(/<punct-[\.\"]>/, " ", :g);
===SORRY!=== Error while compiling:
Unable to parse expression in metachar:sym<assert>; couldn't find final '>' (corresponding starter was at line 1)
------> my $b = $a.subst(/<punct⏏-[\.\"]>/, " ", :g);
expecting any of:
argument list
term
> my $b = $a.subst(/<punct>-<[\.\"]>/, " ", :g);
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
------> my $b = $a.subst(/<punct>⏏-<[\.\"]>/, " ", :g);
Unable to parse regex; couldn't find final '/'
------> my $b = $a.subst(/<punct>-⏏<[\.\"]>/, " ", :g);
> my $b = $a.subst(/<- [\.\"] + punct>/, " ", :g); # $b is blank space, not want I want
> my $b = $a.subst(/<[\W] - [\.\"]>/, " ", :g);
12 678 93.45 "foo"
# this works, but clumsy; I want to
# elegantly say: punctuations except \, and \"
# using predefined class <punct>;
What is the best approach?
I think the most natural solution is to use a "character class arithmetic expression". This entails using + and - prefixes on any number of either Unicode properties or [...] character classes:
#;# -+$12,678,93.45 "foo" *&
<+:punct -[."]> # +$12 678 93.45 "foo"
This can be read as "the class of characters that have the Unicode property punct minus the . and " characters".
Your input string includes + and $. These are not considered "punctuation" characters. You could explicitly add them to the set of characters being replaced by spaces:
<:punct +[+$] -[."] > # 12 678 93.45 "foo"
(I've dropped the initial + before :punct. If you don't write a + or - for the first item in a character class arithmetic expression then + is assumed.)
There's a Unicode property that covers all "symbols" including + and $ so you could use that instead:
<:punct +:symbol -[."] > # 12 678 93.45 "foo"
To recap, you can combine any number of:
Unicode properties like :punct that start with a : and correspond to some character property specified by Unicode; or
[...] character classes that enumerate specific characters, backslash character classes (like \d), or character ranges (eg a..z).
If an overall <...> assertion is to be a character class arithmetic expression then the first character after the opening < must be one of four characters:
: introducing a Unicode property (eg <:punct ...>);
[ introducing a [...] character class (eg <[abc ...>);
+ or a -. This may be followed by spaces. It must then be followed by either a Unicode property (:foo) or a [...] character class (eg <+ :punct ...>).
Thereafter each additional property or character class in the same overall character class arithmetic expression must be preceded by a + or - with or without additional spaces (eg <:punct - [."] ...>).
You can group sub-expressions in parentheses.
I'm not sure what the precise semantics of + and - are. I note this surprising result:
say $a.subst(/<-[."] +:punct>/, " ", :g); # substitutes ALL characters!?!
Built ins of the form <...> are not accepted in character class arithmetic expressions.
This is true even if they're called "character classes" in the doc. This includes ones that are nothing like a character class (eg <ident> is called a character class in the doc even though it matches a string of multiple characters which string matches a particular pattern!) but also ones that seem like they are character classes like <punct> or <digit>. (Many of these latter correspond directly to Unicode properties so you just use those instead.)
To use a backslash "character class" like \d in a character class arithmetic expression using + and - arithmetic you must list it within a [...] character class.
Combining assertions
While <punct> can't be combined with other assertions using character class arithmetic it can be combined with other regex constructs using the & regex conjunction operator:
<punct> & <-[."]> # +$12 678 93.45 "foo"
Depending on the state of compiler optimization (and as of 2019 there's been almost no effort applied to the regex engine), this will be slower in general than using real character classes.

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

replace every other space with new line

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?
#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

regex remove punct removes non-punctuation characters in R

While filtering and cleaning text in Hebrew, I found that
gsub("[[:punct:]]", "", txt)
actually removes a relevant character. The character is "ק" and it is located in the "E" spot on the keyboard. Interestingly, the gsub function in R removes the "ק" character and then all words get messed up. Does anyone have an idea why?
According to Regular Expressions as used in R:
Certain named classes of characters are predefined. Their
interpretation depends on the locale (see locales); the interpretation
below is that of the POSIX locale.
Acc. to POSIX locale, [[:punct:]]should capture ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~. So, you might need to adjust your regex to remove only the characters you want:
txt <- "!\"#$%&'()*+,\\-./:;<=>?#[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?#[\\^\\]_`{|}~-]", "", txt, perl = T)
Sample program output:
[1] ""