Regular expression problem - regex

what's the regex for get all match about:
IF(.....);
I need to get the start and the end of the previous string: the content can be also ( and ) and then can be other (... IF (...) ....)
I need ONLY content inside IF.
Any idea ?
That's because, I need to get an Excel formula (if condition) and transforms it to another language (java script).
EDIT:
i tried
`/IF\s*(\(\s*.+?\s*\))/i or /IF(\(.+?\))/`
this doesn't work because it match only if there aren't ) or ( inside 'IF(...)'

I suspect you have a problewm that is not suitable for regex matching. You want to do unbounded counting (so you can match opening and closing parentheses) and this is more than a regexp can handle. Hand-rolling a parser to do the matching you want shouldn't be hard, though.
Essentially (pseudo-code):
Find "IF"
Ensure next character is "("
Initialise counter parendepth to 1
While parendepth > 0:
place next character in ch
if ch == "(":
parendepth += 1
if ch == ")":
parendepth -= 1
Add in small amounts of "remember start" and "remember end" and you should be all set.

This is one way to do it in Perl. Any regex flavor that allows recursion
should have this capability.
In this example, the fact that the correct parenthesis are annotated
(see the output) and balanced, means its possible to store the data
in a structured way.
This in no way validates anything, its just a quick solution.
use strict;
use warnings;
##
$/ = undef;
my $str = <DATA>;
my ($lvl, $keyword) = ( 0, '(?:IF|ELSIF)' ); # One or more keywords
# (using 2 in this example)
my $kwrx = qr/
(\b $keyword \s*) #1 - keword capture group
( #2 - recursion group
\( # literal '('
( #3 - content capture group
(?:
(?> [^()]+ ) # any non parenth char
| (?2) # or, recurse group 2
)*
)
\) # literal ')'
)
| ( (?:(?!\b $keyword \s*).)+ ) #4
| ($keyword) #5
/sx;
##
print "\n$str\n- - -\n";
findKeywords ( $str );
exit 0;
##
sub findKeywords
{
my ($str) = #_;
while ($str =~ /$kwrx/g)
{
# Process keyword(s), recurse its contents
if (defined $2) {
print "${1}[";
$lvl++;
findKeywords ( $3 );
}
# Process non-keyword text
elsif (defined $4) {
print "$4";
}
elsif (defined $5) {
print "$5";
}
}
if ($lvl > 0) {
print ']';
$lvl--;
}
}
__DATA__
IF( some junk IF (inner meter(s)) )
THEN {
IF ( its in
here
( IF (a=5)
ELSIF
( b=5
and IF( a=4 or
IF(its Monday) and there are
IF( ('lots') IF( ('of') IF( ('these') ) ) )
)
)
)
then its ok
)
ELSIF ( or here() )
ELSE (or nothing)
}
Output:
IF( some junk IF (inner meter(s)) )
THEN {
IF ( its in
here
( IF (a=5)
ELSIF
( b=5
and IF( a=4 or
IF(its Monday) and there are
IF( ('lots') IF( ('of') IF( ('these') ) ) )
)
)
)
then its ok
)
ELSIF ( or here() )
ELSE (or nothing)
}
- - -
IF[ some junk IF [inner meter(s)] ]
THEN {
IF [ its in
here
( IF [a=5]
ELSIF
[ b=5
and IF[ a=4 or
IF[its Monday] and there are
IF[ ('lots') IF[ ('of') IF[ ('these') ] ] ]
]
]
)
then its ok
]
ELSIF [ or here() ]
ELSE (or nothing)
}

Expanding on Paolo's answer, you might also need to worry about spaces and case:
/IF\s*(\(\s*.+?\s*\))/i

This should work and capture all the text between parentheses, including both parentheses, as the first match:
/IF(\(.+?\))/
Please note that it won't match IF() (empty parentheses): if you want to match empty parentheses too, you can replace the + (match one or more) with an * (match zero or more):
/IF(\(.*?\))/
--- EDIT
If you need to match formulas with parentheses (besides the outmost ones) you can use
/IF(\(.*\))/
which will make the regex "not greedy" by removing the ?. This way it will match the longest string possible. Sorry, I assumed wrongly that you did not have any sub-parentheses.

It's not possible only using regular expressions. If you are or can use .NET you should look in to using Balanced Matching.

Related

perl regex to get comma not in parenthesis or nested parenthesis

I have a comma separated string and I want to match every comma that is not in parenthesis (parenthesis are guaranteed to be balanced).
a , (b) , (d$_,c) , ((,),d,(,))
The commas between a and (b), (b) and (d$,c), (d$,c) and ((,),d,(,)) should match but not inside (d$_,c) or ((,),d,(,)).
Note: Eventually I want to split the string by these commas.
It tried this regex:
(?!<(?:\(|\[)[^)\]]+),(?![^(\[]+(?:\)|\])) from here but it only works for non-nested parenthesis.
You may use
(\((?:[^()]++|(?1))*\))(*SKIP)(*F)|,
See the regex demo
Details
(\((?:[^()]++|(?1))*\)) - Capturing group 1: matches a substring between balanced parentheses:
\( - a ( char
(?:[^()]++|(?1))* - zero or more occurrences of 1+ chars other than ( and ) or the whole Group 1 pattern (due to the regex subroutine (?1) that is necessary here since only a part of the whole regex pattern is recursed)
\) - a ) char.
(*SKIP)(*F) - omits the found match and starts the next search from the end of the match
| - or
, - matches a comma outside nested parentheses.
A single regex for this is massively overcomplicated and difficult to maintain or extend. Here is an iterative parser approach:
use strict;
use warnings;
my $str = 'a , (b) , (d$_,c) , ((,),d,(,))';
my $nesting = 0;
my $buffer = '';
my #vals;
while ($str =~ m/\G([,()]|[^,()]+)/g) {
my $token = $1;
if ($token eq ',' and !$nesting) {
push #vals, $buffer;
$buffer = '';
} else {
$buffer .= $token;
if ($token eq '(') {
$nesting++;
} elsif ($token eq ')') {
$nesting--;
}
}
}
push #vals, $buffer if length $buffer;
print "$_\n" for #vals;
You can use Parser::MGC to construct this sort of parser more abstractly.

How to remove strings which do not start or end with specific substring?

Unfortunately, I'm not a regex expert, so I need a little help.
I'm looking for the solution how to grep an array of strings to get two lists of strings which do not start (1) or end (2) with the specific substring.
Let's assume we have an array with strings matching to the following rule:
[speakerId]-[phrase]-[id].txt
i.e.
10-phraseone-10.txt 11-phraseone-3.txt 1-phraseone-2.txt
2-phraseone-1.txt 3-phraseone-1.txt 4-phraseone-1.txt
5-phraseone-3.txt 6-phraseone-2.txt 7-phraseone-2.txt
8-phraseone-10.txt 9-phraseone-2.txt 10-phrasetwo-1.txt
11-phrasetwo-1.txt 1-phrasetwo-1.txt 2-phrasetwo-1.txt
3-phrasetwo-1.txt 4-phrasetwo-1.txt 5-phrasetwo-1.txt
6-phrasetwo-3.txt 7-phrasetwo-10.txt 8-phrasetwo-1.txt
9-phrasetwo-1.txt 10-phrasethree-10.txt 11-phrasethree-3.txt
1-phrasethree-1.txt 2-phrasethree-11.txt 3-phrasethree-1.txt
4-phrasethree-3.txt 5-phrasethree-1.txt 6-phrasethree-3.txt
7-phrasethree-1.txt 8-phrasethree-1.txt 9-phrasethree-1.txt
Let's introduce variables:
$speakerId
$phrase
$id1, $id2
I would like to grep a list and obtain an array:
with elements which contain specific $phrase but we exclude those strigns which simultaneously start with specific $speakerId AND end with one of specified id's (for instance $id1 or $id2)
with elements which have specific $speakerId and $phrase but do NOT contain one of specific ids at the end (warning: remember to not exclude the 10 or 11 for $id=1 , etc.)
Maybe someone coulde use the following code to write the solution:
#AllEntries = readdir(INPUTDIR);
#Result1 = grep(/blablablahere/, #AllEntries);
#Result2 = grep(/anotherblablabla/, #AllEntries);
closedir(INPUTDIR);
Assuming a basic pattern to match your example:
(?:^|\b)(\d+)-(\w+)-(?!1|2)(\d+)\.txt(?:\b|$)
Which breaks down as:
(?:^|\b) # starts with a new line or a word delimeter
(\d+)- # speakerid and a hyphen
(\w+)- # phrase and a hyphen
(\d+) # id
\.txt # file extension
(?:\b|$) # end of line or word delimeter
You can assert exclusions using negative look-ahead. For instance, to include all matches that do not have the phrase phrasetwo you can modify the above expression to use a negative look-ahead:
(?:^|\b)(\d+)-(?!phrasetwo)(\w+)-(\d+)\.txt(?:\b|$)
Note how I include (?!phrasetwo). Alternatively, you find all phrasethree entries that end in an even number by using a look-behind instead of a look-ahead:
(?:^|\b)(\d+)-phrasethree-(\d+)(?<![13579])\.txt(?:\b|$)
(?<![13579]) just makes sure the last number of the ID falls on an even number.
It sounds a bit like you're describing a query function.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my ( $set_a, $set_b ) = query( 2, 'phrasethree', [ 1, 3 ] );
print Dumper( { a => $set_a, b => $set_b } );
# a) fetch elements which
# 1. match $phrase
# 2. exclude $speakerId
# 3. match #ids
# b) fetch elements which
# 1. match $phrase
# 2. match $speakerId
# 3. exclude #ids
sub query {
my ( $speakerId, $passPhrase, $id_ra ) = #_;
my %has_id = map { ( $_ => 0 ) } #{$id_ra};
my ( #a, #b );
while ( my $filename = glob '*.txt' ) {
if ( $filename =~ m{\A ( \d+ )-( .+? )-( \d+ ) [.] txt \z}xms ) {
my ( $_speakerId, $_passPhrase, $_id ) = ( $1, $2, $3 );
if ( $_passPhrase eq $passPhrase ) {
if ( $_speakerId ne $speakerId
&& exists $has_id{$_id} )
{
push #a, $filename;
}
if ( $_speakerId eq $speakerId
&& !exists $has_id{$_id} )
{
push #b, $filename;
}
}
}
}
return ( \#a, \#b );
}
I like the approach with pure regular expressions using negative lookaheads and -behinds. However, it's a little bit hard to read. Maybe code like this could be more self-explanatory. It uses standard perl idioms that are readable like english in some cases:
my #all_entries = readdir(...);
my #matching_entries = ();
foreach my $entry (#all_entries) {
# split file name
next unless /^(\d+)-(.*?)-(\d+).txt$/;
my ($sid, $phrase, $id) = ($1, $2, $3);
# filter
next unless $sid eq "foo";
next unless $id == 42 or $phrase eq "bar";
# more readable filter rules
# match
push #matching_entries, $entry;
}
# do something with #matching_entries
If you really want to express something that complex in a grep list transformation, you could write code like this:
my #matching_entries = grep {
/^(\d)-(.*?)-(\d+).txt$/
and $1 eq "foo"
and ($3 == 42 or $phrase eq "bar")
# and so on
} readdir(...)

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".

Finding results from and between groups of parentheses with regexp

Text format:
(Superships)
Eirik Raude - olajkutató fúrósziget
(Eirik Raude - Oil Patch Explorer)
I need regex to match text beetween first set of parentheses. Results: text1.
I need regex to match text beetween first set of parentheses and second set of parentheses. Results: text2.
I need regex to match text beetween second set of parentheses. Results: text3.
text1: Superships, represent english title,
text2: Eirik Raude - olajkutató fúrósziget, represent hungarian subtitle,
text3: Eirik Raude - Oil Patch Explorer, represent english subtitle.
I need regex for perl script to match this title and subtitle. Example script:
($anchor) = $tree->look_down(_tag=>"h1", class=>"blackbigtitle");
if ($anchor) {
$elem = $anchor;
my ($engtitle, $engsubtitle, $hunsubtitle #tmp);
while (($elem = $elem->right()) &&
((ref $elem) && ($elem->tag() ne "table"))) {
#tmp = get_all_text($elem);
push #lines, #tmp;
$line = join(' ', #tmp);
if (($engtitle) = $line =~ m/**regex need that return text1**/) {
push #{$prog->{q(title)}}, [$engtitle, 'en'];
t "english-title added: $engtitle";
}
elsif (($engsubtitle) = $line =~ m/**regex need that return text3**/) {
push #{$prog->{q(sub-title)}}, [$subtitle, 'en'];
t "english_subtitle added: $engsubtitle";
}
elsif (($hunsubtitle) = $line =~ m/**regex need that return text2**/) {
push #{$prog->{q(hun-subtitle)}}, [$hunsubtitle, 'hu'];
t "hungarinan_subtitle added: $hunsubtitle";
}
}
}
Considering your comment, you can do something like :
if (($english_title) = $line =~ m/^\(([^)]+)\)$/) {
$found_english_title = 1;
# do stuff
} elsif (($english-subtitle) = $line =~ m/^([^()]+)$/) {
# do stuff
} elsif ($found_english_title && ($hungarian-title) = $line =~ m/^\(([^)]+)\)$/) {
# do stuff
}
If you need to match them all in one expression:
\(([^)]+)\)([^(]+)\(([^)]+)\)
This matches (, then anything that's not ), then ), then anything that's not (, then, (, ... I think you get the picture.
First group will be text1, second group will be text2, third group will be text3.
You can also just make a more generix regex that matches something like "(text1)", "(text1)text2(text3)" or "text1(text2)" when applied several times:
(?:^|[()])([^()])(?:[()]|$)
This matches the beginning of the string or ( or ), then characters that are not ( or ), then ( or ) or the end of the string. :? is for non-capturing group, so the first group will have the string. Something more complex is necessary to match ( with ) every time, i.e., it can match "(text1(".

Parsing YAML-like text file into hash structure

I've got the text file:
country = {
tag = ENG
ai = {
flags = { }
combat = { ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD }
continent = { "Oceania" }
area = { "America" "Maine" "Georgia" "Newfoundland" "Cuba" "Bengal" "Carnatic" "Ceylon" "Tanganyika" "The Mascarenes" "The Cape" "Gold" "St Helena" "Guiana" "Falklands" "Bermuda" "Oregon" }
region = { "North America" "Carribean" "India" }
war = 50
ferocity = no
}
date = { year = 0 month = january day = 0 }
}
What I'm trying to do is to parse this text into perl hash structure, so that the output after data dump looks like this:
$VAR1 = {
'country' => {
'ai' => {
'area' => [
'America',
'Maine',
'Georgia',
'Newfoundland',
'Cuba',
'Bengal',
'Carnatic',
'Ceylon',
'Tanganyika',
'The Mascarenes',
'The Cape',
'Gold',
'St Helena',
'Guiana',
'Falklands',
'Bermuda',
'Oregon'
],
'combat' => [
'ROY',
'WLS',
'PUR',
'SCO',
'EIR',
'FRA',
'DEL',
'USA',
'QUE',
'BGL',
'MAH',
'MOG',
'VIJ',
'MYS',
'DLH',
'GUJ',
'ORI',
'JAI',
'ASS',
'MLC',
'MYA',
'ARK',
'PEG',
'TAU',
'HYD'
],
'continent' => [
'Oceania'
],
'ferocity' => 'no',
'flags' => [],
'region' => [
'North America',
'Carribean',
'India'
],
'war' => 50
},
'date' => {
'day' => 0,
'month' => 'january',
'year' => 0
},
'tag' => 'ENG'
}
};
Hardcoded version might look like this:
#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;
my $ret;
$ret->{'country'}->{tag} = 'ENG';
$ret->{'country'}->{ai}->{flags} = [];
my #qw = qw( ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD );
$ret->{'country'}->{ai}->{combat} = \#qw;
$ret->{'country'}->{ai}->{continent} = ["Oceania"];
$ret->{'country'}->{ai}->{area} = ["America", "Maine", "Georgia", "Newfoundland", "Cuba", "Bengal", "Carnatic", "Ceylon", "Tanganyika", "The Mascarenes", "The Cape", "Gold", "St Helena", "Guiana", "Falklands", "Bermuda", "Oregon"];
$ret->{'country'}->{ai}->{region} = ["North America", "Carribean", "India"];
$ret->{'country'}->{ai}->{war} = 50;
$ret->{'country'}->{ai}->{ferocity} = 'no';
$ret->{'country'}->{date}->{year} = 0;
$ret->{'country'}->{date}->{month} = 'january';
$ret->{'country'}->{date}->{day} = 0;
sub hash_sort {
my ($hash) = #_;
return [ (sort keys %$hash) ];
}
$Data::Dumper::Sortkeys = \hash_sort;
print Dumper($ret);
I have to admit I have a huge problem dealing with nested curly brackets.
I've tried to solve it by using greedy and ungreedy matching, but it seems it didn't do the trick. I've also read about extended patterns (like (?PARNO)) but I have absolutely no clue how to use them in my particular problem. Order of data is irrelevant, since I have the hash_sort subroutine.
I'll apprieciate any help.
I broke it down to some simple assumptions:
An entry would consist of an identifier followed by an equals sign
An entry would be one of three basic types: a level or set or a single value
A set has 3 forms: 1) quoted, space-separated list; 2) key-value pairs, 3) qw-like unquoted list
A set of key-value pairs must contain an indentifier for a key and either nonspaces or a quoted
value for a value
See the interspersed comments.
use strict;
use warnings;
my $simple_value_RE
= qr/^ \s* (\p{Alpha}\w*) \s* = \s* ( [^\s{}]+ | "[^"]*" ) \s* $/x
;
my $set_or_level_RE
= qr/^ \s* (\w+) \s* = \s* [{] (?: ([^}]+) [}] )? \s* $/x
;
my $quoted_set_RE
= qr/^ \s* (?: "[^"]+" \s+ )* "[^"]+" \s* $/x
;
my $associative_RE
= qr/^ \s*
(?: \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) \s+ )*
\p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ )
\s* $
/x
;
my $pair_RE = qr/ \b ( \p{Alpha}\w* ) \s* = \s* ( "[^"]+" | \S+ )/x;
sub get_level {
my $handle = shift;
my %level;
while ( <$handle> ) {
# if the first character on the line is a close, then we're done
# at this level
last if m/^\s*[}]/;
my ( $key, $value );
# get simple values
if (( $key, $value ) = m/$simple_value_RE/ ) {
# done.
}
elsif (( $key, my $complete_set ) = m/$set_or_level_RE/ ) {
if ( $complete_set ) {
if ( $complete_set =~ m/$quoted_set_RE/ ) {
# Pull all quoted values with global flag
$value = [ $complete_set =~ m/"([^"]+)"/g ];
}
elsif ( $complete_set =~ m/$associative_RE/ ) {
# going to create a hashref. First, with a global flag
# repeatedly pull all qualified pairs
# then split them to key and value by spliting them at
# the first '='
$value
= { map { split /\s*=\s*/, $_, 2 }
( $complete_set =~ m/$pair_RE/g )
};
}
else {
# qw-like
$value = [ split( ' ', $complete_set ) ];
}
}
else {
$value = get_level( $handle );
}
}
$level{ $key } = $value;
}
return wantarray ? %level : \%level;
}
my %base = get_level( \*DATA );
Well, as David suggested, the easiest way would be to get whatever produced the file to use a standard format. JSON, YAML, or XML would be much easier to parse.
But if you really have to parse this format, I'd write a grammar for it using Regexp::Grammars (if you can require Perl 5.10) or Parse::RecDescent (if you can't). This'll be a little tricky, especially because you seem to be using braces for both hashes & arrays, but it should be doable.
The contents look pretty regular. Why not perform some substitutions on the content and convert it to hash syntax, then eval it. That would be a quick and dirty way to convert it.
You can also write a parser, assuming you know the grammar.