Logical condition / math expression validation - regex

I want to validate a string, that is containing an expression like
isFun && ( isHelpful || isUseful )
These expressions can contain operands, binary operators and unary operators:
my $unary_operator = qr/(?:\+|\-|!|not)/;
my $binary_operator = qr/(?:<|>|<=|>=|==|\!\=|<=>|\~\~|\&|\||\^|\&\&|\|\||lt|gt|le|ge|and|or|xor|eq|ne|cmp)/i;
my $operand = qr/[a-z0-9_]+/i;
They are kind of similiar to what you know from perl's condition pattern (for example inside of if statements). The brackets have to be balanced and unary operators can only be used once in a row.
I would like to find a perl compatible regular expression, that makes sure to have a valid logical / mathmatical expression where only the given operators are used and the operands are matching the regex that is given by $operand. With recursions, this could be possible in perl. The statement is in infix notation.
My current solution is to parse a tree and perform some iterations, but I want to compress this algorithm into a single regular expression.
For my first try (I still excluded all verbose operands), I used
my $re =
qr{
( # 1
( # 2 operands ...
$operand
)
|
( # 3 unary operators with recursion for operand
(?:$unary_operator(?!\s*$unary_operator)\s*(?1))
)
|
( # 4 balance brackets
\(
\s*(?1)\s*
\)
)
|
( # 5 binary operators with recursion for each operand
(?1)\s*$binary_operator\s*(?1)
)
)
}x;
...which ends up in infinite recursion. I think the recursion might be caused in using the first (?1) in parenthesis 5.
Is somebody out there with a working solution?

The following should work for validating your expressions:
use strict;
use warnings;
my $unary_operator = qr/(?:\+|\-|!|not)/;
my $binary_operator = qr/(?:<|>|<=|>=|==|\!\=|<=>|\~\~|\&|\||\^|\&\&|\|\||lt|gt|le|ge|and|or|xor|eq|ne|cmp)/i;
my $operand = qr/[a-z0-9_]+/i;
my $re =
qr{^
( # 1
\s*
(?> (?:$unary_operator \s*)* )
(?:
\b$operand\b
|
\( (?1) \)
)
(?:
\s*
$binary_operator
\s*
(?1)
)*
\s*
)
$}x;
while (<DATA>) {
chomp;
my ($expr, $status) = split "#", $_, 2;
if ($expr =~ $re) {
print "good $_\n";
} else {
print "bad $_\n";
}
}
__DATA__
isFun # Good
isFun && isHelpful # Good
isFun && isHelpful || isUseful # Good
isFun && ( isHelpful || isUseful ) # Good
isFun && ( isHelpful || (isUseful) ) # Good
isFun && ( isHelpful || (isUseful ) # Fail - Missing )
not (isFun && (isHelpful || (isBad && isDangerous))) # Good
isGenuine!isReal # Fail
!isGenuine isReal # Fail
Outputs:
good isFun # Good
good isFun && isHelpful # Good
good isFun && isHelpful || isUseful # Good
good isFun && ( isHelpful || isUseful ) # Good
good isFun && ( isHelpful || (isUseful) ) # Good
bad isFun && ( isHelpful || (isUseful ) # Fail - Missing )
good not (isFun && (isHelpful || (isBad && isDangerous))) # Good
bad isGenuine!isReal # Fail
bad !isGenuine isReal # Fail

Related

Perl conditional regex with inequalities

I have one query. I have to match 2 strings in one if condition:
$release = 5.x (Here x should be greater than or equal to 3)
$version = Rx (this x should be greater than or equal to 5 if $release is 5.3, otherwise anything is acceptable)
e.g. 5.1R11 is not acceptable, 5.3R4 is not, 5.3R5 is acceptable, and 5.4 R1 is acceptable.
I have written a code like this:
$release = "5.2";
$version = "R4";
if ( $release =~ /5.(?>=3)(\d)/ && $version =~ m/R(?>=5)(\d)/ )
{
print "OK";
}
How can I write this?
This is really a three-level version string, and I suggest that you use Perl's version facility
use strict;
use warnings 'all';
use feature 'say';
use version;
my $release = '5.2';
my $version = 'R4';
if ( $version =~ /R(\d+)/ && version->parse("$release.$1") ge v5.3.5 ) {
say 'OK';
}
In regex (?>) it means atomic grouping.
Group the element so it will stored into $1 then compare the $1 with number so it should be
if (( ($release =~ /5\.(\d)/) && ($1 > 3) ) && (($version =~ m/R(\d)/) && ($1 >= 3) ) )
{
print "OK\n";
}
I got the correct one after modifying mkhun's solution:
if ((($release =~ /5.3/)) && (($version =~ m/R(\d+)(.\d+)?/) && ($1 >= 5))
|| ((($release =~ /5.(\d)/) && ($1 > 3)) && ($version =~ m/R(\d+)(.\d+)?/)) )
{
print "OK\n";
}

How to remove strings which do not start or end with specific substring?

Unfortunately, I'm not a regex expert, so I need a little help.
I'm looking for the solution how to grep an array of strings to get two lists of strings which do not start (1) or end (2) with the specific substring.
Let's assume we have an array with strings matching to the following rule:
[speakerId]-[phrase]-[id].txt
i.e.
10-phraseone-10.txt 11-phraseone-3.txt 1-phraseone-2.txt
2-phraseone-1.txt 3-phraseone-1.txt 4-phraseone-1.txt
5-phraseone-3.txt 6-phraseone-2.txt 7-phraseone-2.txt
8-phraseone-10.txt 9-phraseone-2.txt 10-phrasetwo-1.txt
11-phrasetwo-1.txt 1-phrasetwo-1.txt 2-phrasetwo-1.txt
3-phrasetwo-1.txt 4-phrasetwo-1.txt 5-phrasetwo-1.txt
6-phrasetwo-3.txt 7-phrasetwo-10.txt 8-phrasetwo-1.txt
9-phrasetwo-1.txt 10-phrasethree-10.txt 11-phrasethree-3.txt
1-phrasethree-1.txt 2-phrasethree-11.txt 3-phrasethree-1.txt
4-phrasethree-3.txt 5-phrasethree-1.txt 6-phrasethree-3.txt
7-phrasethree-1.txt 8-phrasethree-1.txt 9-phrasethree-1.txt
Let's introduce variables:
$speakerId
$phrase
$id1, $id2
I would like to grep a list and obtain an array:
with elements which contain specific $phrase but we exclude those strigns which simultaneously start with specific $speakerId AND end with one of specified id's (for instance $id1 or $id2)
with elements which have specific $speakerId and $phrase but do NOT contain one of specific ids at the end (warning: remember to not exclude the 10 or 11 for $id=1 , etc.)
Maybe someone coulde use the following code to write the solution:
#AllEntries = readdir(INPUTDIR);
#Result1 = grep(/blablablahere/, #AllEntries);
#Result2 = grep(/anotherblablabla/, #AllEntries);
closedir(INPUTDIR);
Assuming a basic pattern to match your example:
(?:^|\b)(\d+)-(\w+)-(?!1|2)(\d+)\.txt(?:\b|$)
Which breaks down as:
(?:^|\b) # starts with a new line or a word delimeter
(\d+)- # speakerid and a hyphen
(\w+)- # phrase and a hyphen
(\d+) # id
\.txt # file extension
(?:\b|$) # end of line or word delimeter
You can assert exclusions using negative look-ahead. For instance, to include all matches that do not have the phrase phrasetwo you can modify the above expression to use a negative look-ahead:
(?:^|\b)(\d+)-(?!phrasetwo)(\w+)-(\d+)\.txt(?:\b|$)
Note how I include (?!phrasetwo). Alternatively, you find all phrasethree entries that end in an even number by using a look-behind instead of a look-ahead:
(?:^|\b)(\d+)-phrasethree-(\d+)(?<![13579])\.txt(?:\b|$)
(?<![13579]) just makes sure the last number of the ID falls on an even number.
It sounds a bit like you're describing a query function.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my ( $set_a, $set_b ) = query( 2, 'phrasethree', [ 1, 3 ] );
print Dumper( { a => $set_a, b => $set_b } );
# a) fetch elements which
# 1. match $phrase
# 2. exclude $speakerId
# 3. match #ids
# b) fetch elements which
# 1. match $phrase
# 2. match $speakerId
# 3. exclude #ids
sub query {
my ( $speakerId, $passPhrase, $id_ra ) = #_;
my %has_id = map { ( $_ => 0 ) } #{$id_ra};
my ( #a, #b );
while ( my $filename = glob '*.txt' ) {
if ( $filename =~ m{\A ( \d+ )-( .+? )-( \d+ ) [.] txt \z}xms ) {
my ( $_speakerId, $_passPhrase, $_id ) = ( $1, $2, $3 );
if ( $_passPhrase eq $passPhrase ) {
if ( $_speakerId ne $speakerId
&& exists $has_id{$_id} )
{
push #a, $filename;
}
if ( $_speakerId eq $speakerId
&& !exists $has_id{$_id} )
{
push #b, $filename;
}
}
}
}
return ( \#a, \#b );
}
I like the approach with pure regular expressions using negative lookaheads and -behinds. However, it's a little bit hard to read. Maybe code like this could be more self-explanatory. It uses standard perl idioms that are readable like english in some cases:
my #all_entries = readdir(...);
my #matching_entries = ();
foreach my $entry (#all_entries) {
# split file name
next unless /^(\d+)-(.*?)-(\d+).txt$/;
my ($sid, $phrase, $id) = ($1, $2, $3);
# filter
next unless $sid eq "foo";
next unless $id == 42 or $phrase eq "bar";
# more readable filter rules
# match
push #matching_entries, $entry;
}
# do something with #matching_entries
If you really want to express something that complex in a grep list transformation, you could write code like this:
my #matching_entries = grep {
/^(\d)-(.*?)-(\d+).txt$/
and $1 eq "foo"
and ($3 == 42 or $phrase eq "bar")
# and so on
} readdir(...)

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".

Regular expression problem

what's the regex for get all match about:
IF(.....);
I need to get the start and the end of the previous string: the content can be also ( and ) and then can be other (... IF (...) ....)
I need ONLY content inside IF.
Any idea ?
That's because, I need to get an Excel formula (if condition) and transforms it to another language (java script).
EDIT:
i tried
`/IF\s*(\(\s*.+?\s*\))/i or /IF(\(.+?\))/`
this doesn't work because it match only if there aren't ) or ( inside 'IF(...)'
I suspect you have a problewm that is not suitable for regex matching. You want to do unbounded counting (so you can match opening and closing parentheses) and this is more than a regexp can handle. Hand-rolling a parser to do the matching you want shouldn't be hard, though.
Essentially (pseudo-code):
Find "IF"
Ensure next character is "("
Initialise counter parendepth to 1
While parendepth > 0:
place next character in ch
if ch == "(":
parendepth += 1
if ch == ")":
parendepth -= 1
Add in small amounts of "remember start" and "remember end" and you should be all set.
This is one way to do it in Perl. Any regex flavor that allows recursion
should have this capability.
In this example, the fact that the correct parenthesis are annotated
(see the output) and balanced, means its possible to store the data
in a structured way.
This in no way validates anything, its just a quick solution.
use strict;
use warnings;
##
$/ = undef;
my $str = <DATA>;
my ($lvl, $keyword) = ( 0, '(?:IF|ELSIF)' ); # One or more keywords
# (using 2 in this example)
my $kwrx = qr/
(\b $keyword \s*) #1 - keword capture group
( #2 - recursion group
\( # literal '('
( #3 - content capture group
(?:
(?> [^()]+ ) # any non parenth char
| (?2) # or, recurse group 2
)*
)
\) # literal ')'
)
| ( (?:(?!\b $keyword \s*).)+ ) #4
| ($keyword) #5
/sx;
##
print "\n$str\n- - -\n";
findKeywords ( $str );
exit 0;
##
sub findKeywords
{
my ($str) = #_;
while ($str =~ /$kwrx/g)
{
# Process keyword(s), recurse its contents
if (defined $2) {
print "${1}[";
$lvl++;
findKeywords ( $3 );
}
# Process non-keyword text
elsif (defined $4) {
print "$4";
}
elsif (defined $5) {
print "$5";
}
}
if ($lvl > 0) {
print ']';
$lvl--;
}
}
__DATA__
IF( some junk IF (inner meter(s)) )
THEN {
IF ( its in
here
( IF (a=5)
ELSIF
( b=5
and IF( a=4 or
IF(its Monday) and there are
IF( ('lots') IF( ('of') IF( ('these') ) ) )
)
)
)
then its ok
)
ELSIF ( or here() )
ELSE (or nothing)
}
Output:
IF( some junk IF (inner meter(s)) )
THEN {
IF ( its in
here
( IF (a=5)
ELSIF
( b=5
and IF( a=4 or
IF(its Monday) and there are
IF( ('lots') IF( ('of') IF( ('these') ) ) )
)
)
)
then its ok
)
ELSIF ( or here() )
ELSE (or nothing)
}
- - -
IF[ some junk IF [inner meter(s)] ]
THEN {
IF [ its in
here
( IF [a=5]
ELSIF
[ b=5
and IF[ a=4 or
IF[its Monday] and there are
IF[ ('lots') IF[ ('of') IF[ ('these') ] ] ]
]
]
)
then its ok
]
ELSIF [ or here() ]
ELSE (or nothing)
}
Expanding on Paolo's answer, you might also need to worry about spaces and case:
/IF\s*(\(\s*.+?\s*\))/i
This should work and capture all the text between parentheses, including both parentheses, as the first match:
/IF(\(.+?\))/
Please note that it won't match IF() (empty parentheses): if you want to match empty parentheses too, you can replace the + (match one or more) with an * (match zero or more):
/IF(\(.*?\))/
--- EDIT
If you need to match formulas with parentheses (besides the outmost ones) you can use
/IF(\(.*\))/
which will make the regex "not greedy" by removing the ?. This way it will match the longest string possible. Sorry, I assumed wrongly that you did not have any sub-parentheses.
It's not possible only using regular expressions. If you are or can use .NET you should look in to using Balanced Matching.

How can I match certain nested parentheses in Perl?

^\s*[)]*\s*$ and ^\s*[(]*\s*$ matches the parentheses ( and ) which are bold. That is, what am trying is to ignore parentheses that are single and not (condition1) parentheses:
while
( #matches here
( #matches here
(condition1) && (condition2) &&
condition3
) ||
(#matches here
(condition4) ||
condition5 &&
(condition6)
) #matches here
) #matches here
but if I have like this it does not match:
while
(( #does not match here
(condition1) && (condition2) &&
condition3
) ||
(
(condition4) ||
condition5 &&
(condition6)
) ) #does not match here
or
while
((( #does not match here
(condition1) && (condition2) &&
condition3
)) ||
(( #does not match here
(condition4) ||
condition5 &&
(condition6)
) ) ) #does not match here
How can I match all the parentheses that are single?
I'd personally recommend that you use a simple stack to figure out open and closing brackets rather than trip over regular expressions.