I'm using ANTLR with Presto grammar in order to parse SQL queries.
This is the original string definition I've used to parse queries:
STRING
: '\'' ( '\\' .
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
This worked ok for most queries until I saw queries with different escaping rules. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
So I've modified my String definition and now it looks like:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
However, this won't work for the query mentioned above as I'm getting
'\\'',''),'
as a single string.
The predicate returns True for the following query.
Any idea how can I handle this query as well?
Thanks,
Nir.
In the end I was able to solve it. This is the expression I was using:
STRING
: '\'' ( '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
grammar Question;
sql
#init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\\\'' '\'' // '\\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\\']* '\'\'' ~[\\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
File input.txt (not having more examples, I can only guess) :
replace1(replace(some_col,'\\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
Execution :
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[#0,0:7='replace1',<REPLACE>,1:0]
[#1,8:8='(',<'('>,1:8]
[#2,9:15='replace',<REPLACE>,1:9]
[#3,16:16='(',<'('>,1:16]
[#4,17:24='some_col',<ID>,1:17]
[#5,25:25=',',<','>,1:25]
[#6,26:30=''\\''',<STRING>,1:26]
[#7,31:31=',',<','>,1:31]
[#8,32:33='''',<STRING>,1:32]
[#9,34:34=')',<')'>,1:34]
[#10,35:35=',',<','>,1:35]
[#11,36:39=''\"'',<STRING>,1:36]
[#12,40:40=' ',<WS>,channel=1,1:40]
[#13,41:41=',',<','>,1:41]
[#14,42:43='''',<STRING>,1:42]
[#15,44:44=')',<')'>,1:44]
[#16,45:45='\n',<NL>,channel=1,1:45]
[#17,46:53='replace2',<REPLACE>,2:0]
[#18,54:54='(',<'('>,2:8]
[#19,55:62='some_col',<ID>,2:9]
[#20,63:63=',',<','>,2:17]
[#21,64:67='''''',<STRING>,2:18]
[#22,68:68=',',<','>,2:22]
[#23,69:70='''',<STRING>,2:23]
[#24,71:71=')',<')'>,2:25]
[#25,72:72='\n',<NL>,channel=1,2:26]
[#26,73:80='replace3',<REPLACE>,3:0]
[#27,81:81='(',<'('>,3:8]
[#28,82:89='some_col',<ID>,3:9]
[#29,90:90=',',<','>,3:17]
[#30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[#31,106:106=',',<','>,3:33]
[#32,107:111=''xyz'',<STRING>,3:34]
[#33,112:112=')',<')'>,3:39]
[#34,113:113='\n',<NL>,channel=1,3:40]
[#35,114:121='replace4',<REPLACE>,4:0]
[#36,122:122='(',<'('>,4:8]
[#37,123:130='some_col',<ID>,4:9]
[#38,131:131=',',<','>,4:17]
[#39,132:141=''abc\ndef'',<STRING>,4:18]
[#40,142:142=',',<','>,4:28]
[#41,143:147=''xyz'',<STRING>,4:29]
[#42,148:148=')',<')'>,4:34]
[#43,149:149='\n',<NL>,channel=1,4:35]
[#44,150:157='replace5',<REPLACE>,5:0]
[#45,158:158='(',<'('>,5:8]
[#46,159:166='some_col',<ID>,5:9]
[#47,167:167=',',<','>,5:17]
[#48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[#49,186:186=',',<','>,5:36]
[#50,187:189=''8'',<STRING>,5:37]
[#51,190:190=')',<')'>,5:40]
[#52,191:191='\n',<NL>,channel=1,5:41]
[#53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352
I have a file like this:
a,b,c,"hello, hi",d
I want the field separator to be not space, comma, not space.
Currently I have
cat file | awk 'BEGIN { FS = "[^ ],[^ ]" } ; { print $4 }'
which should give "hello, hi" but it returns nothing. I'm quite new to this regular expression thing so any help would be appreciated.
Eh, no it should not give hello, hi. What actually happens is:
a,b,c,"hello, hi",d
|| ||| || ||_|Third fied separator
|| ||| ||_______|
|| ||| | $3
|| |||_|
|| || Second field separator
|| ||
|| |+- $2 is a comma
||_|
| First field separator
|
+- $0 is empty
So after the third field separator, the line is empty. You can verify this behaviour with
aaa,baa,caa,"hello, hi",daa
as input-file.
If you work with CSV files regularly, consider installing the csvtool, then you can simply say:
echo 'a,b,c,"hello, hi",d' | csvtool col 4 -
and it will spit out
"hello, hi"
You can also use sed:
>sed 's/.*\("[^"]*"\).*/\1/' <<< 'a,b,c,"hello, hi",d'
"hello, hi"
or grep:
>grep -o '"[^"]*"' <<< 'a,b,c,"hello, hi",d'
"hello, hi"
solution is to define the field content instead of field separator. You need to use gawk because standard awk does not have this feature natively. (on linux, awk = gawk)
echo 'a,b,c,"hello, hi",d' \
| awk '
# define the content with FPAT
# here any non , or a encapsulate quoted content
BEGIN{ FPAT = "[^,]*|\"[^\"]*\"" }
# for showing each field
{for (i=1;i<=NF;i++) printf( "field %d: %s\n", i, $i)}
'
field 1: a
field 2: b
field 3: c
field 4: "hello, hi"
field 5: d
By default, regex matching try to always take the longest possible so a "..,..." is longer than ".. and/or ..." taking full quoted string instead of partial coma separated content of the same string
I'm trying to munge a simple grammar with a perl regex (note this isn’t intended for production use, just a quick analysis for providing editor hints/completions). For instance,
my $GRAMMAR = qr{(?(DEFINE)
(?<expr> \( (?&expr) \) | (?&number) | (?&var) | (?&expr) (?&op) (?&expr) )
(?<number> \d++ )
(?<var> [a-z]++ )
(?<op> [-+*/] )
)}x;
I would like to be able to run this as
$expr =~ /$GRAMMAR(?&expr)/;
and then access all the variable names. However, according to perlre,
Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.
So apparently this is not possible. I could try using a (?{ code }) block to save variable names to a hash, but this doesn't respect backtracking (i.e. the assignment’s side effect persists even if the variable is backtracked past).
Is there any way to get everything captured by a given named capture group, including recursive matches? Or do I need to manually dig through the individual pieces (and thus duplicate all the patterns)?
The necessity of having to add capturing and backtracking machinery is one of the shortcomings that Regexp::Grammars addresses.
However, the grammar in your question is left-recursive, which neither Perl regexes nor a recursive-descent parser will parse.
Adapting your grammar to Regexp::Grammars and factoring out left-recursion produces
my $EXPR = do {
use Regexp::Grammars;
qr{
^ <Expr> $
<rule: Expr> <Term> <ExprTail>
| <Term>
<rule: Term> <Number>
| <Var>
| \( <MATCH=Expr> \)
<rule: ExprTail> <Op> <Expr>
<token: Op> \+ | \- | \* | \/
<token: Number> \d++
<token: Var> [a-z]++
}x;
};
Note that this simple grammar gives all operators equal precedence rather than Please Excuse My Dear Aunt Sally.
You want to extract all variable names, so you could walk the AST as in
sub all_variables {
my($root,$var) = #_;
$var ||= {};
++$var->{ $root->{Var} } if exists $root->{Var};
all_variables($_, $var) for grep ref $_, values %$root;
wantarray ? keys %$var : [ keys %$var ];
}
and print the result with
if ("(a + (b - c))" =~ $EXPR) {
print "[$_]\n" for sort +all_variables \%/;
}
else {
print "no match\n";
}
Another approach is to install an autoaction for the Var rule that records names of variables as they are successfully parsed.
package JustTheVarsMaam;
sub new { bless {}, shift }
sub Var {
my($self,$result) = #_;
++$self->{VARS}{$result};
$result;
}
sub all_variables { keys %{ $_[0]->{VARS} } }
1;
Call this one as in
my $vars = JustTheVarsMaam->new;
if ("(a + (b - c))" =~ $EXPR->with_actions($vars)) {
print "[$_]\n" for sort $vars->all_variables;
}
else {
print "no match\n";
}
Either way, the output is
[a]
[b]
[c]
Recursivity is native with Marpa::R2 using the BNF in the __DATA__ section below:
#!env perl
use strict;
use diagnostics;
use Marpa::R2;
my $input = shift || '(a + (b - c))';
my $grammar_source = do {local $/; <DATA>};
my $recognizer = Marpa::R2::Scanless::R->new
(
{
grammar => Marpa::R2::Scanless::G->new
(
{
source => \$grammar_source,
action_object => __PACKAGE__,
}
)
},
);
my %vars = ();
sub new { return bless {}, shift;}
sub varAction { ++$vars{$_[1]}};
$recognizer->read(\$input);
$recognizer->value() || die "No parse";
print join(', ', sort keys %vars) . "\n";
__DATA__
:start ::= expr
expr ::= NUMBER
| VAR action => varAction
| expr OP expr
| '(' expr ')'
NUMBER ~ [\d]+
VAR ~ [a-z]+
OP ~ [-+*/]
WS ~ [\s]+
:discard ~ WS
The output is:
a, b, c
Your question was adressing only how to get the variable names, so no notion of operator associativity and so on in this answer. Just note that Marpa has no problem with that, if needed.
I'm trying to modify a back reference in PowerShell but am having no luck :(
This is my example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper()) | `$1 |"
If I run it I get this:
| Jane Doe | 456 |
But I'm really expecting this:
| JANE DOE | 456 |
If I run the following (the same as above but without the '()' on the call to ToUpper):
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper) | `$1 |"
I get this:
| string ToUpper(), string ToUpper(System.Globalization.CultureInfo culture) | 456 |
So it would appear that PowerShell knows that the back reference '$2' is a string but why can't I get PowerShell to convert it to upper case?
Terry
[Regex]::Replace('456,Jane Doe',
'^(\d{3}),(.*)$',
{
param($m)
'| ' + $m.Groups[2].Value.ToUpper() + ' | ' + $m.Groups[1].Value + ' |'
}
)
Not very pretty, I admit. And you sadly cannot use script blocks as replacement in the -replace operator.
Just to explain what is happening, in "| $(`"`$2`".ToUpper()) | `$1 |" PowerShell is evaluating the highlighted subexpression before passing the string to the -replace operator, rather than after the replace operation has occurred.
In other words, ToUpper is called on the string value $2, resulting in | $2 | $1 | being used for the replace operation. You can see this by including a letter in the subexpression string, for example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"zz `$2`".ToUpper()) | `$1 |"
This has an effective replace string of | ZZ $2 | $1 |, giving | ZZ Jane Doe | 456 | as the result.
Similarly, the second version omitting parenthesis, "| $(`"`$2`".ToUpper) | `$1 |", is evaluated as "some string".ToUpper, which puts the array of overload definitions for the ToUpper method on System.String in the replace string.
To keep the replace operation as a one-liner, Joey's answer using the MatchEvaluator overload to Regex.Replace works well. Or you might do the string formatting yourself based on the results of a -match:
if( '456,Jane Doe' -match '^(\d{3}),(.*)$' ) {
'| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
}
If this needs to be replaced in the context of a larger string, you can always do a literal replace to get the final result:
PS> $r = '| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
PS> 'A longer string with 456,Jane Doe in it.'.Replace( $matches[0], $r )
A longer string with | JANE DOE | 456 | in it.