Getting Regex error on VB.NET on a specific string - regex

Can someone help me please? I want to understand what could have made this regex match stop before showing whether there's a match or not.
So, I'm using Regex and it looks like i'm stuck on an infinite loop.
Here's the pattern and string to match. A match shouldn't be found on this case, i verified that manually.
pattern = "((^)|([A-z0-9_]*[A-z][A-z0-9_]*\((([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*)|([0-9]+(\.[0-9]+)?))(\ *[\+\-\*\/]\ *(([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*)|([0-9]+(\.[0-9]+)?)))*\)\ *[\+\-\*\/]\ *)+)(\()*((([0-9]+(\.[0-9]+)?)\ *[\+\-\*\/]\ *)*([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*))"
string = "(SUM(UTRANCELL.PMNORABESTABLISHATTEMPTPACKETSTREAM)+SUM(UTRANCELL.PMNORABESTABLISHATTEMPTPACKETSTREAM128)+SUM(UTRANCELL.PMNORABESTABLISHATTEMPTPACKETINTERACTIVE)+SUM(UTRANCELL.PMNORABESTATTEMPTSSTREAMHS))"
Here's the grammar i used to build the regex:
<formula_com_erro> ::= ( <inicio_linha> | (<subformula> ['+'|'-'|'*'|'/'])+ ) {'('} ( { (<constante_numerica>) ('+'|'-'|'*'|'/') } ( <alias> '.' <contador> ) )
<formula_com_erro> ::= ((^)|([A-z0-9_]*[A-z][A-z0-9_]*\((([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*)|([0-9]+(\.[0-9]+)?))(\ *[\+\-\*\/]\ *(([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*)|([0-9]+(\.[0-9]+)?)))*\)\ *[\+\-\*\/]\ *)+)(\()*((([0-9]+(\.[0-9]+)?)\ *[\+\-\*\/]\ *)*([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*))
<subformula> ::= <nome_funcao> '(' (<alias> '.' <contador>) | (<constante_numerica>) {['+'|'-'|'*'|'/'] ((<alias> '.' <contador>) | (<constante_numerica>))} ')'
<subformula> ::= [A-z0-9_]*[A-z][A-z0-9_]*\((([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*)|([0-9]+(\.[0-9]+)?))(\ *[\+\-\*\/]\ *(([A-z0-9_]*[A-z][A-z0-9_]*\.[A-z0-9_]*[A-z][A-z0-9_]*)|([0-9]+(\.[0-9]+)?)))*\)
<nome_funcao> ::= ( <letra> | <digito> | '_' )+
<nome_funcao> ::= [A-z0-9_]*[A-z][A-z0-9_]*
<alias> ::= ( <letra> | <digito> | '_' )+
<alias> ::= [A-z0-9_]*[A-z][A-z0-9_]*
<contador> ::= ( <letra> | <digito> | '_' )+
<contador> ::= [A-z0-9_]*[A-z][A-z0-9_]*
<constante_numerica> ::= ( <digito> )+
<constante_numerica> ::= [0-9]+(\.[0-9]+)?

Related

Handling different escaping sequences?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
This is the original string definition I've used to parse queries:
STRING
: '\'' ( '\\' .
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
This worked ok for most queries until I saw queries with different escaping rules. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
So I've modified my String definition and now it looks like:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
However, this won't work for the query mentioned above as I'm getting
'\\'',''),'
as a single string.
The predicate returns True for the following query.
Any idea how can I handle this query as well?
Thanks,
Nir.
In the end I was able to solve it. This is the expression I was using:
STRING
: '\'' ( '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
grammar Question;
sql
#init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\\\'' '\'' // '\\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\\']* '\'\'' ~[\\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
File input.txt (not having more examples, I can only guess) :
replace1(replace(some_col,'\\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
Execution :
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[#0,0:7='replace1',<REPLACE>,1:0]
[#1,8:8='(',<'('>,1:8]
[#2,9:15='replace',<REPLACE>,1:9]
[#3,16:16='(',<'('>,1:16]
[#4,17:24='some_col',<ID>,1:17]
[#5,25:25=',',<','>,1:25]
[#6,26:30=''\\''',<STRING>,1:26]
[#7,31:31=',',<','>,1:31]
[#8,32:33='''',<STRING>,1:32]
[#9,34:34=')',<')'>,1:34]
[#10,35:35=',',<','>,1:35]
[#11,36:39=''\"'',<STRING>,1:36]
[#12,40:40=' ',<WS>,channel=1,1:40]
[#13,41:41=',',<','>,1:41]
[#14,42:43='''',<STRING>,1:42]
[#15,44:44=')',<')'>,1:44]
[#16,45:45='\n',<NL>,channel=1,1:45]
[#17,46:53='replace2',<REPLACE>,2:0]
[#18,54:54='(',<'('>,2:8]
[#19,55:62='some_col',<ID>,2:9]
[#20,63:63=',',<','>,2:17]
[#21,64:67='''''',<STRING>,2:18]
[#22,68:68=',',<','>,2:22]
[#23,69:70='''',<STRING>,2:23]
[#24,71:71=')',<')'>,2:25]
[#25,72:72='\n',<NL>,channel=1,2:26]
[#26,73:80='replace3',<REPLACE>,3:0]
[#27,81:81='(',<'('>,3:8]
[#28,82:89='some_col',<ID>,3:9]
[#29,90:90=',',<','>,3:17]
[#30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[#31,106:106=',',<','>,3:33]
[#32,107:111=''xyz'',<STRING>,3:34]
[#33,112:112=')',<')'>,3:39]
[#34,113:113='\n',<NL>,channel=1,3:40]
[#35,114:121='replace4',<REPLACE>,4:0]
[#36,122:122='(',<'('>,4:8]
[#37,123:130='some_col',<ID>,4:9]
[#38,131:131=',',<','>,4:17]
[#39,132:141=''abc\ndef'',<STRING>,4:18]
[#40,142:142=',',<','>,4:28]
[#41,143:147=''xyz'',<STRING>,4:29]
[#42,148:148=')',<')'>,4:34]
[#43,149:149='\n',<NL>,channel=1,4:35]
[#44,150:157='replace5',<REPLACE>,5:0]
[#45,158:158='(',<'('>,5:8]
[#46,159:166='some_col',<ID>,5:9]
[#47,167:167=',',<','>,5:17]
[#48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[#49,186:186=',',<','>,5:36]
[#50,187:189=''8'',<STRING>,5:37]
[#51,190:190=')',<')'>,5:40]
[#52,191:191='\n',<NL>,channel=1,5:41]
[#53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352

How to parse/identify double quoted string from the big expression using MARPA:R2 perl

Problem in parsing/identifying double quoted string from the big expression.
use strict;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new({
default_action => '[values]',
source => \(<<'END_OF_SOURCE'),
:start ::= expression
expression ::= expression OP expression
expression ::= expression COMMA expression
expression ::= func LPAREN PARAM RPAREN
expression ::= PARAM
PARAM ::= STRING | REGEX_STRING
:discard ~ sp
sp ~ [\s]+
COMMA ~ [,]
STRING ~ [^ \/\(\),&:\"~]+
REGEX_STRING ~ yet to identify
OP ~ ' - ' | '&'
LPAREN ~ '('
RPAREN ~ ')'
func ~ 'func'
END_OF_SOURCE
});
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
my $input1 = "func(foo)&func(bar)"; -> able to parse it properly by parsing foo and bar as STRING LEXEME.
my $input2 = "\"foo\""; -> Here, I want to parse foo as regex_string LEXEME. REGEX_STRING is something which is enclosed in double quotes.
my $input3 = "func(\"foo\") - func(\"bar\")"; -> Here, func should be taken as func LEXEME, ( should be LPAREN, ) should be RPAREN, foo as REGEX_STRING, - as OP and same for func(\"bar\")
my $input4 = "func(\"foo\")"; -> Here, func should be taken as func LEXEME, ( should be LPAREN, ) should be RPAREN, foo as REGEX_STRING
print "Trying to parse:\n$input\n\n";
$recce->read(\$input);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
What did i try :
1st method:
My REGEX_STRING should be something : REGEX_STRING -> ~ '\"([^:]*?)\"'
If i try putting above REGEX_STRING in the code with input expression as my $input4 = "func(\"foo\")"; i get error like :
Error in SLIF parse: No lexeme found at line 1, column 5
* String before error: func(
* The error was at line 1, column 5, and at character 0x0022 '"', ...
* here: "foo")
Marpa::R2 exception
2nd method:
Tried including a rule like :
PARAM ::= STRING | REGEX_STRING
REGEX_STRING ::= '"' QUOTED_STRING '"'
STRING ~ [^ \/\(\),&:\"~]+
QUOTED_STRING ~ [^ ,&:\"~]+
The problem here is-> Input is given using:
my $input4 = "func(\"foo\")";
So, here it gives error because there are now two ways to parse this expression, either whole thing between double quotes which is func(\"foo\")
is taken as QUOTED_STRING or func should be taken as func LEXEME and so on.
Please help how do i fix this thing.
use 5.026;
use strictures;
use Data::Dumper qw(Dumper);
use Marpa::R2 qw();
my $grammar = Marpa::R2::Scanless::G->new({
bless_package => 'parsetree',
source => \<<'',
:default ::= action => [values] bless => ::lhs
lexeme default = bless => ::name latm => 1
:start ::= expression
expression ::= expression OP expression
expression ::= expression COMMA expression
expression ::= func LPAREN PARAM RPAREN
expression ::= PARAM
PARAM ::= STRING | REGEXSTRING
:discard ~ sp
sp ~ [\s]+
COMMA ~ [,]
STRING ~ [^ \/\(\),&:\"~]+
REGEXSTRING ::= '"' QUOTEDSTRING '"'
QUOTEDSTRING ~ [^ ,&:\"~]+
OP ~ ' - ' | '&'
LPAREN ~ '('
RPAREN ~ ')'
func ~ 'func'
});
# say $grammar->show_rules;
for my $input (
'func(foo)&func(bar)', '"foo"', 'func("foo") - func("bar")', 'func("foo")'
) {
my $r = Marpa::R2::Scanless::R->new({
grammar => $grammar,
# trace_terminals => 1
});
$r->read(\$input);
say "# $input";
say Dumper $r->value;
}
2nd method posted in question worked for me. I just have to include :
lexeme default = latm => 1
in my code.

How can I extract the which a or b in perl regex matching /(a|b)/

I am trying to match different logic expression, such as: "$a and $b" using Perl regex, here is my code:
$input =~ /^(.*)\s(and|or|==|<|>|>=|<=)\s(.*)$/ {
$arg1=$1;
$arg2=$3;
$opt=$2;
}
and my purpose is to get:
$arg1="$ARGV[0]=~/\w{4}/"
$arg2="$num_arg==1"
$opt ="and"
I want to get the exact value matched in the or expression. I don't want to do the same thing for all the cases to match one by one, and hardcode the operator.
Does anyone know how to solve the problem?
This code works for me:
$input = '$ARGV[0]=~/\w{4}/ and $num_arg==1';
if ($input=~/^(.*)\s(and|or|==|<|>|>=|<=)\s(.*)$/) {
$arg1=$1;
$arg2=$3;
$opt=$2;
print "$arg1\n$arg2\n$opt\n";
}
You need a little parser able to reveal the structure of a logical expression. That is because you may have another expression inside a term. You can use perl to test your grammar using Marpa::R2 package.
As a first attempt I would write:
<expression> ::= <term> | <expression> <binary-op> <term>
<term> ::= <factor> <binary-op> <factor> | <unary-op><factor>
<factor> ::= <id>
<binary-op> ::= (and|or|==|<|>|>=|<=)
<unary-op> ::= (not | ! )
One thing for sure that you can't complete describe the syntax of a logical expression using only regular expressions, it will always lack some valid case.
The Perl Code for validation
use Modern::Perl;
use Marpa::R2;
my $dsl = <<'END_OF_DSL';
:default ::= action => [name,values]
lexeme default = latm => 1
Expression ::= Term
| Expression BinaryOP Term
Term ::= Factor BinaryOP Factor
| UnaryOP Factor
Factor ::= ID
ID ~ [\w]+
BinaryOP ~ 'and' | 'or' | '==' | '<' | '>' | '>=' | '<='
UnaryOP ~ 'not' | '!'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
# your input
my $input = 'a and b or !c';
# your parser
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
# process input
my $recce = Marpa::R2::Scanless::R->new(
{ grammar => $grammar, semantics_package => 'My_Actions' } );
my $length_read = $recce->read( \$input );
die "Read ended after $length_read of ", length $input, " characters"
if $length_read != length $input;

How to find the next unbalanced brace?

The regex below captures everything up to the last balanced }.
Now, what regex would be able to capture everything up to the next unbalanced }? In other words, how can I can get ... {three {four}} five} from $str instead of just ... {three {four}}?
my $str = "one two {three {four}} five} six";
if ( $str =~ /
(
.*?
{
(?> [^{}] | (?-1) )+
}
)
/sx
)
{
print "$1\n";
}
So you want to match
[noncurlies [block noncurlies [...]]] "}"
where a block is
"{" [noncurlies [block noncurlies [...]]] "}"
As a grammar:
start : text "}"
text : noncurly* ( block noncurly* )*
block : "{" text "}"
noncurly : /[^{}]/
As a regex (5.10+):
/
^
(
(
[^{}]*
(?:
\{ (?-1) \}
[^{}]*
)*
)
\}
)
/x
As a regex (5.10+):
/
^ ( (?&TEXT) \} )
(?(DEFINE)
(?<TEXT> [^{}]* (?: (?&BLOCK) [^{}]* )* )
(?<BLOCK> \{ (?&TEXT) \} )
)
/x

Regex to split HTML tags

I have an HTML string like so:
<img src="http://foo"><img src="http://bar">
What would be the regex pattern to split this into two separate img tags?
How sure are you that your string is exactly that? What about input like this:
<img alt=">" src="http://foo" >
<img src='http://bar' alt='<' >
What programming language is this? Is there some reason you're not using a standard HTML-parsing class to handle this? Regexes are only a good approach when you have an extremely well-known set of inputs. They don't work for real HTML, only for rigged demos.
Even if you must use a regex, you should use a proper grammatical one. This is quite easy. I've tested the following programacita on a zillion web pages. It takes care of the cases I outline above — and one or two others, too.
#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
my $img_rx = qr{
# save capture in $+{TAG} variable
(?<TAG> (?&image_tag) )
# remainder is pure declaration
(?(DEFINE)
(?<image_tag>
(?&start_tag)
(?&might_white)
(?&attributes)
(?&might_white)
(?&end_tag)
)
(?<attributes>
(?:
(?&might_white)
(?&one_attribute)
) *
)
(?<one_attribute>
\b
(?&legal_attribute)
(?&might_white) = (?&might_white)
(?:
(?&quoted_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?: (?&required_attribute)
| (?&optional_attribute)
| (?&standard_attribute)
| (?&event_attribute)
# for LEGAL parse only, comment out next line
| (?&illegal_attribute)
)
)
(?<illegal_attribute> \b \w+ \b )
(?<required_attribute>
alt
| src
)
(?<optional_attribute>
(?&permitted_attribute)
| (?&deprecated_attribute)
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<permitted_attribute>
height
| is map
| long desc
| use map
| width
)
(?<deprecated_attribute>
align
| border
| hspace
| vspace
)
(?<standard_attribute>
class
| dir
| id
| style
| title
| xml:lang
)
(?<event_attribute>
on abort
| on click
| on dbl click
| on mouse down
| on mouse out
| on key down
| on key press
| on key up
)
(?<unquoted_value>
(?&unwhite_chunk)
)
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<unwhite_chunk>
(?:
# (?! [<>'"] )
(?! > )
\S
) +
)
(?<might_white> \s * )
(?<start_tag>
< (?&might_white)
img
\b
)
(?<end_tag>
(?&html_end_tag)
| (?&xhtml_end_tag)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
}six;
$/ = undef;
$_ = <>; # read all input
# strip stuff we aren't supposed to look at
s{ <! DOCTYPE .*? > }{}sx;
s{ <! \[ CDATA \[ .*? \]\] > }{}gsx;
s{ <script> .*? </script> }{}gsix;
s{ <!-- .*? --> }{}gsx;
my $count = 0;
while (/$img_rx/g) {
printf "Match %d at %d: %s\n",
++$count, pos(), $+{TAG};
}
There you go. Nothing to it!
Gee, why would you ever want to use an HTML-parsing class, given how easily HTML can be dealt with in a regex. ☺
Don't do it with regex. Use an HTML/XML parser. You can even run it through Tidy first to clean it up. Most languages have a Tidy library. What language are you using?
This will do it:
<img\s+src=\"[^\"]*?\">
Or you can do this to account for any additional attributes
<img\s+[^>]*?\bsrc=\"[^\"]*?\"[^>]*>
<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">
PHP example:
$prom = '<img src="http://foo"><img src="http://bar">';
preg_match_all('|<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">|',$prom, $matches);
print_r($matches[0]);
One slightly insane/brilliant/weird way to do it would be to split on >< and then add the two characters back respectively to the string after the split.
$string = '<img src="http://foo"><img src="http://bar">';
$KimKardashian = split("><",$string);
$First = $KimKardashian[0] . '>';
$Second = '<' . $KimKardashian[1];