Parsing YAML-like text file into hash structure - regex

I've got the text file:
country = {
tag = ENG
ai = {
flags = { }
combat = { ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD }
continent = { "Oceania" }
area = { "America" "Maine" "Georgia" "Newfoundland" "Cuba" "Bengal" "Carnatic" "Ceylon" "Tanganyika" "The Mascarenes" "The Cape" "Gold" "St Helena" "Guiana" "Falklands" "Bermuda" "Oregon" }
region = { "North America" "Carribean" "India" }
war = 50
ferocity = no
}
date = { year = 0 month = january day = 0 }
}
What I'm trying to do is to parse this text into perl hash structure, so that the output after data dump looks like this:
$VAR1 = {
'country' => {
'ai' => {
'area' => [
'America',
'Maine',
'Georgia',
'Newfoundland',
'Cuba',
'Bengal',
'Carnatic',
'Ceylon',
'Tanganyika',
'The Mascarenes',
'The Cape',
'Gold',
'St Helena',
'Guiana',
'Falklands',
'Bermuda',
'Oregon'
],
'combat' => [
'ROY',
'WLS',
'PUR',
'SCO',
'EIR',
'FRA',
'DEL',
'USA',
'QUE',
'BGL',
'MAH',
'MOG',
'VIJ',
'MYS',
'DLH',
'GUJ',
'ORI',
'JAI',
'ASS',
'MLC',
'MYA',
'ARK',
'PEG',
'TAU',
'HYD'
],
'continent' => [
'Oceania'
],
'ferocity' => 'no',
'flags' => [],
'region' => [
'North America',
'Carribean',
'India'
],
'war' => 50
},
'date' => {
'day' => 0,
'month' => 'january',
'year' => 0
},
'tag' => 'ENG'
}
};
Hardcoded version might look like this:
#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;
my $ret;
$ret->{'country'}->{tag} = 'ENG';
$ret->{'country'}->{ai}->{flags} = [];
my #qw = qw( ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD );
$ret->{'country'}->{ai}->{combat} = \#qw;
$ret->{'country'}->{ai}->{continent} = ["Oceania"];
$ret->{'country'}->{ai}->{area} = ["America", "Maine", "Georgia", "Newfoundland", "Cuba", "Bengal", "Carnatic", "Ceylon", "Tanganyika", "The Mascarenes", "The Cape", "Gold", "St Helena", "Guiana", "Falklands", "Bermuda", "Oregon"];
$ret->{'country'}->{ai}->{region} = ["North America", "Carribean", "India"];
$ret->{'country'}->{ai}->{war} = 50;
$ret->{'country'}->{ai}->{ferocity} = 'no';
$ret->{'country'}->{date}->{year} = 0;
$ret->{'country'}->{date}->{month} = 'january';
$ret->{'country'}->{date}->{day} = 0;
sub hash_sort {
my ($hash) = #_;
return [ (sort keys %$hash) ];
}
$Data::Dumper::Sortkeys = \hash_sort;
print Dumper($ret);
I have to admit I have a huge problem dealing with nested curly brackets.
I've tried to solve it by using greedy and ungreedy matching, but it seems it didn't do the trick. I've also read about extended patterns (like (?PARNO)) but I have absolutely no clue how to use them in my particular problem. Order of data is irrelevant, since I have the hash_sort subroutine.
I'll apprieciate any help.

I broke it down to some simple assumptions:
An entry would consist of an identifier followed by an equals sign
An entry would be one of three basic types: a level or set or a single value
A set has 3 forms: 1) quoted, space-separated list; 2) key-value pairs, 3) qw-like unquoted list
A set of key-value pairs must contain an indentifier for a key and either nonspaces or a quoted
value for a value
See the interspersed comments.
use strict;
use warnings;
my $simple_value_RE
= qr/^ \s* (\p{Alpha}\w*) \s* = \s* ( [^\s{}]+ | "[^"]*" ) \s* $/x
;
my $set_or_level_RE
= qr/^ \s* (\w+) \s* = \s* [{] (?: ([^}]+) [}] )? \s* $/x
;
my $quoted_set_RE
= qr/^ \s* (?: "[^"]+" \s+ )* "[^"]+" \s* $/x
;
my $associative_RE
= qr/^ \s*
(?: \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) \s+ )*
\p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ )
\s* $
/x
;
my $pair_RE = qr/ \b ( \p{Alpha}\w* ) \s* = \s* ( "[^"]+" | \S+ )/x;
sub get_level {
my $handle = shift;
my %level;
while ( <$handle> ) {
# if the first character on the line is a close, then we're done
# at this level
last if m/^\s*[}]/;
my ( $key, $value );
# get simple values
if (( $key, $value ) = m/$simple_value_RE/ ) {
# done.
}
elsif (( $key, my $complete_set ) = m/$set_or_level_RE/ ) {
if ( $complete_set ) {
if ( $complete_set =~ m/$quoted_set_RE/ ) {
# Pull all quoted values with global flag
$value = [ $complete_set =~ m/"([^"]+)"/g ];
}
elsif ( $complete_set =~ m/$associative_RE/ ) {
# going to create a hashref. First, with a global flag
# repeatedly pull all qualified pairs
# then split them to key and value by spliting them at
# the first '='
$value
= { map { split /\s*=\s*/, $_, 2 }
( $complete_set =~ m/$pair_RE/g )
};
}
else {
# qw-like
$value = [ split( ' ', $complete_set ) ];
}
}
else {
$value = get_level( $handle );
}
}
$level{ $key } = $value;
}
return wantarray ? %level : \%level;
}
my %base = get_level( \*DATA );

Well, as David suggested, the easiest way would be to get whatever produced the file to use a standard format. JSON, YAML, or XML would be much easier to parse.
But if you really have to parse this format, I'd write a grammar for it using Regexp::Grammars (if you can require Perl 5.10) or Parse::RecDescent (if you can't). This'll be a little tricky, especially because you seem to be using braces for both hashes & arrays, but it should be doable.

The contents look pretty regular. Why not perform some substitutions on the content and convert it to hash syntax, then eval it. That would be a quick and dirty way to convert it.
You can also write a parser, assuming you know the grammar.

Related

Rewriting a recursive regex for older Perl version

The following piece of code works just fine with Perl (v5.16.2). However, when I run it using Perl v5.8.9, it complains about the following regex. How can I rewrite this regex in a way that works with Perl v5.8.9. (I can't update the version).
REGEX:
use strict;
use warnings;
our %formula_per_k;
INIT {
# List all functions that you want to allow in formulas. All other words will be interpretted as variables.
my #FORMULA_FUNCS = qw(sqrt exp log);
# Load the data via a file.
my $data = do {local $/; <DATA>};
# Parse K blocks
while ($data =~ m{
^K \s+ (\w+) \s* \{
( (?: [^{}]+ | \{(?2)\} )* ) # Matched braces only.
\}
}mgx) {
my ($name, $params) = ($1, $2);
# Parse LOL block
next if $params !~ m{
LOL \s* \{
( (?: [^{}]+ | \{(?1)\} )*? ) # Matched braces only.
\}
}mx;
my $lol = $1;
# Start building anonymous subroutine
my $conditions = '';
# Parse Conditions and Formulas
while ($lol =~ m{
COND \s* \{ (.*?) \} \s*
FORMULA \s* \{ (.*?) \}
}gx) {
my ($cond, $formula) = ($1, $2);
# Remove Excess spacing and translate variable into perl scalar.
for ($cond, $formula) {
s/^\s+|\s+$//g;
s{([a-zA-Z]+)}{
my $var = $1;
$var = "\$hashref->{$var}" if ! grep {$var eq $_} #FORMULA_FUNCS;
$var
}eg;
}
$conditions .= "return $formula if $cond; ";
}
my $code = "sub {my \$hashref = shift; ${conditions} return; }";
my $sub = eval $code;
if ($#) {
die "Invalid formulas in $name: $#";
}
$formula_per_k{$name} = $sub;
}
}
sub formula_per_k {
my ($k, $vars) = #_;
die "Unrecognized K value '$k'" if ! exists $formula_per_k{$k};
return $formula_per_k{$k}($vars);
}
print "'K1', {d => .1} = " . formula_per_k('K1', {d => .1}) . "\n";
print "'K1', {d => .05} = " . formula_per_k('K1', {d => .05}) . "\n";
print "'K3', {d => .02} = " . formula_per_k('K3', {d => .02}) . "\n";
print "'K3', {d => .021} = " . formula_per_k('K3', {d => .021}) . "\n";
__DATA__
... #OTHER STUFFS
K K1 {
LOL {
COND { d < 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d) }
COND { d >= 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d+0.4) }
}
}
... #OTHER STUFFS
K K2 {
LOL {
COND { d < 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d) }
COND { d >= 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d+0.8) }
}
}
... #OTHER STUFFS
K K3 {
LOL {
COND { d < 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d) }
COND { d >= 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d+0.3) }
}
}
... #OTHER STUFF
Outputs:
'K1', {d => .1} = 2.13345237791561
'K1', {d => .05} = 2.01370729772479
'K3', {d => .02} = -4.13029437251523
'K3', {d => .021} = -4.13002941430942
ERROR:
Sequence (?1...) not recognized in regex; marked by <-- HERE in m/
^K \s+ M3 \s* {
( (?: [^{}]+ | {(?2 <-- HERE )} )* ) # Matched braces only.
}
/ at ./code.pl line 215, <RFILE> line 12.
UPDATE:
Code is updated.
This was originally suggested by https://stackoverflow.com/users/1733163/miller
Before the introduction of (?PARNO), we had to use (??{ code }) to create recursive regular expressions. An example can be found in perlre - Extended Patterns.
The following is tested on v5.16.2, v5.20.0, and locally on a v5.8.9 perlbrew:
our $braces_re;
$braces_re = qr{
\{
(?:
(?> [^{}]+ )
|
(??{ $braces_re })
)*
\}
}sx;
# parse FOO block
while (
$data =~ m{
^FOO \s+ (\w+) \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )* ) # Matched braces only.
\}
}mgx
)
{
my $params = $1;
# parse BAR block
next if $params !~ m{
BAR \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )*? ) # Matched braces only.
\}
}mx;
# SOME CODE
}
Note, I intentionally separated out the declaration of the _re variable and its initialization. There are some versions of perl that will let you declare a recursive regular expression in the same statement as the initialization, but v5.8.9 is not one of them.
Also, if you're comfortable altering your original regex more than just dropping in a replacement for (?PARNO) notation, then the above can be reduced to the following. Also confirmed on v5.16.2:
my $braces_re;
$braces_re = qr{
(?:
(?> [^{}]+ )
| # The following is a "postponed" regular subexpression.
\{ (??{ $braces_re }) \} # Deferred execution enables recursive regex
)*
}sx;
# parse FOO block
while ( $data =~ m{^FOO \s+ (\w+) \s* \{ ( $braces_re ) \} }mgx ) {
my $params = $1;
# parse BAR block
next if $params !~ m{BAR \s* \{ ( $braces_re ) \}}mx;
# SOME CODE
}

How to remove strings which do not start or end with specific substring?

Unfortunately, I'm not a regex expert, so I need a little help.
I'm looking for the solution how to grep an array of strings to get two lists of strings which do not start (1) or end (2) with the specific substring.
Let's assume we have an array with strings matching to the following rule:
[speakerId]-[phrase]-[id].txt
i.e.
10-phraseone-10.txt 11-phraseone-3.txt 1-phraseone-2.txt
2-phraseone-1.txt 3-phraseone-1.txt 4-phraseone-1.txt
5-phraseone-3.txt 6-phraseone-2.txt 7-phraseone-2.txt
8-phraseone-10.txt 9-phraseone-2.txt 10-phrasetwo-1.txt
11-phrasetwo-1.txt 1-phrasetwo-1.txt 2-phrasetwo-1.txt
3-phrasetwo-1.txt 4-phrasetwo-1.txt 5-phrasetwo-1.txt
6-phrasetwo-3.txt 7-phrasetwo-10.txt 8-phrasetwo-1.txt
9-phrasetwo-1.txt 10-phrasethree-10.txt 11-phrasethree-3.txt
1-phrasethree-1.txt 2-phrasethree-11.txt 3-phrasethree-1.txt
4-phrasethree-3.txt 5-phrasethree-1.txt 6-phrasethree-3.txt
7-phrasethree-1.txt 8-phrasethree-1.txt 9-phrasethree-1.txt
Let's introduce variables:
$speakerId
$phrase
$id1, $id2
I would like to grep a list and obtain an array:
with elements which contain specific $phrase but we exclude those strigns which simultaneously start with specific $speakerId AND end with one of specified id's (for instance $id1 or $id2)
with elements which have specific $speakerId and $phrase but do NOT contain one of specific ids at the end (warning: remember to not exclude the 10 or 11 for $id=1 , etc.)
Maybe someone coulde use the following code to write the solution:
#AllEntries = readdir(INPUTDIR);
#Result1 = grep(/blablablahere/, #AllEntries);
#Result2 = grep(/anotherblablabla/, #AllEntries);
closedir(INPUTDIR);
Assuming a basic pattern to match your example:
(?:^|\b)(\d+)-(\w+)-(?!1|2)(\d+)\.txt(?:\b|$)
Which breaks down as:
(?:^|\b) # starts with a new line or a word delimeter
(\d+)- # speakerid and a hyphen
(\w+)- # phrase and a hyphen
(\d+) # id
\.txt # file extension
(?:\b|$) # end of line or word delimeter
You can assert exclusions using negative look-ahead. For instance, to include all matches that do not have the phrase phrasetwo you can modify the above expression to use a negative look-ahead:
(?:^|\b)(\d+)-(?!phrasetwo)(\w+)-(\d+)\.txt(?:\b|$)
Note how I include (?!phrasetwo). Alternatively, you find all phrasethree entries that end in an even number by using a look-behind instead of a look-ahead:
(?:^|\b)(\d+)-phrasethree-(\d+)(?<![13579])\.txt(?:\b|$)
(?<![13579]) just makes sure the last number of the ID falls on an even number.
It sounds a bit like you're describing a query function.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my ( $set_a, $set_b ) = query( 2, 'phrasethree', [ 1, 3 ] );
print Dumper( { a => $set_a, b => $set_b } );
# a) fetch elements which
# 1. match $phrase
# 2. exclude $speakerId
# 3. match #ids
# b) fetch elements which
# 1. match $phrase
# 2. match $speakerId
# 3. exclude #ids
sub query {
my ( $speakerId, $passPhrase, $id_ra ) = #_;
my %has_id = map { ( $_ => 0 ) } #{$id_ra};
my ( #a, #b );
while ( my $filename = glob '*.txt' ) {
if ( $filename =~ m{\A ( \d+ )-( .+? )-( \d+ ) [.] txt \z}xms ) {
my ( $_speakerId, $_passPhrase, $_id ) = ( $1, $2, $3 );
if ( $_passPhrase eq $passPhrase ) {
if ( $_speakerId ne $speakerId
&& exists $has_id{$_id} )
{
push #a, $filename;
}
if ( $_speakerId eq $speakerId
&& !exists $has_id{$_id} )
{
push #b, $filename;
}
}
}
}
return ( \#a, \#b );
}
I like the approach with pure regular expressions using negative lookaheads and -behinds. However, it's a little bit hard to read. Maybe code like this could be more self-explanatory. It uses standard perl idioms that are readable like english in some cases:
my #all_entries = readdir(...);
my #matching_entries = ();
foreach my $entry (#all_entries) {
# split file name
next unless /^(\d+)-(.*?)-(\d+).txt$/;
my ($sid, $phrase, $id) = ($1, $2, $3);
# filter
next unless $sid eq "foo";
next unless $id == 42 or $phrase eq "bar";
# more readable filter rules
# match
push #matching_entries, $entry;
}
# do something with #matching_entries
If you really want to express something that complex in a grep list transformation, you could write code like this:
my #matching_entries = grep {
/^(\d)-(.*?)-(\d+).txt$/
and $1 eq "foo"
and ($3 == 42 or $phrase eq "bar")
# and so on
} readdir(...)

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.

Finding results from and between groups of parentheses with regexp

Text format:
(Superships)
Eirik Raude - olajkutató fúrósziget
(Eirik Raude - Oil Patch Explorer)
I need regex to match text beetween first set of parentheses. Results: text1.
I need regex to match text beetween first set of parentheses and second set of parentheses. Results: text2.
I need regex to match text beetween second set of parentheses. Results: text3.
text1: Superships, represent english title,
text2: Eirik Raude - olajkutató fúrósziget, represent hungarian subtitle,
text3: Eirik Raude - Oil Patch Explorer, represent english subtitle.
I need regex for perl script to match this title and subtitle. Example script:
($anchor) = $tree->look_down(_tag=>"h1", class=>"blackbigtitle");
if ($anchor) {
$elem = $anchor;
my ($engtitle, $engsubtitle, $hunsubtitle #tmp);
while (($elem = $elem->right()) &&
((ref $elem) && ($elem->tag() ne "table"))) {
#tmp = get_all_text($elem);
push #lines, #tmp;
$line = join(' ', #tmp);
if (($engtitle) = $line =~ m/**regex need that return text1**/) {
push #{$prog->{q(title)}}, [$engtitle, 'en'];
t "english-title added: $engtitle";
}
elsif (($engsubtitle) = $line =~ m/**regex need that return text3**/) {
push #{$prog->{q(sub-title)}}, [$subtitle, 'en'];
t "english_subtitle added: $engsubtitle";
}
elsif (($hunsubtitle) = $line =~ m/**regex need that return text2**/) {
push #{$prog->{q(hun-subtitle)}}, [$hunsubtitle, 'hu'];
t "hungarinan_subtitle added: $hunsubtitle";
}
}
}
Considering your comment, you can do something like :
if (($english_title) = $line =~ m/^\(([^)]+)\)$/) {
$found_english_title = 1;
# do stuff
} elsif (($english-subtitle) = $line =~ m/^([^()]+)$/) {
# do stuff
} elsif ($found_english_title && ($hungarian-title) = $line =~ m/^\(([^)]+)\)$/) {
# do stuff
}
If you need to match them all in one expression:
\(([^)]+)\)([^(]+)\(([^)]+)\)
This matches (, then anything that's not ), then ), then anything that's not (, then, (, ... I think you get the picture.
First group will be text1, second group will be text2, third group will be text3.
You can also just make a more generix regex that matches something like "(text1)", "(text1)text2(text3)" or "text1(text2)" when applied several times:
(?:^|[()])([^()])(?:[()]|$)
This matches the beginning of the string or ( or ), then characters that are not ( or ), then ( or ) or the end of the string. :? is for non-capturing group, so the first group will have the string. Something more complex is necessary to match ( with ) every time, i.e., it can match "(text1(".

Regular expression problem

what's the regex for get all match about:
IF(.....);
I need to get the start and the end of the previous string: the content can be also ( and ) and then can be other (... IF (...) ....)
I need ONLY content inside IF.
Any idea ?
That's because, I need to get an Excel formula (if condition) and transforms it to another language (java script).
EDIT:
i tried
`/IF\s*(\(\s*.+?\s*\))/i or /IF(\(.+?\))/`
this doesn't work because it match only if there aren't ) or ( inside 'IF(...)'
I suspect you have a problewm that is not suitable for regex matching. You want to do unbounded counting (so you can match opening and closing parentheses) and this is more than a regexp can handle. Hand-rolling a parser to do the matching you want shouldn't be hard, though.
Essentially (pseudo-code):
Find "IF"
Ensure next character is "("
Initialise counter parendepth to 1
While parendepth > 0:
place next character in ch
if ch == "(":
parendepth += 1
if ch == ")":
parendepth -= 1
Add in small amounts of "remember start" and "remember end" and you should be all set.
This is one way to do it in Perl. Any regex flavor that allows recursion
should have this capability.
In this example, the fact that the correct parenthesis are annotated
(see the output) and balanced, means its possible to store the data
in a structured way.
This in no way validates anything, its just a quick solution.
use strict;
use warnings;
##
$/ = undef;
my $str = <DATA>;
my ($lvl, $keyword) = ( 0, '(?:IF|ELSIF)' ); # One or more keywords
# (using 2 in this example)
my $kwrx = qr/
(\b $keyword \s*) #1 - keword capture group
( #2 - recursion group
\( # literal '('
( #3 - content capture group
(?:
(?> [^()]+ ) # any non parenth char
| (?2) # or, recurse group 2
)*
)
\) # literal ')'
)
| ( (?:(?!\b $keyword \s*).)+ ) #4
| ($keyword) #5
/sx;
##
print "\n$str\n- - -\n";
findKeywords ( $str );
exit 0;
##
sub findKeywords
{
my ($str) = #_;
while ($str =~ /$kwrx/g)
{
# Process keyword(s), recurse its contents
if (defined $2) {
print "${1}[";
$lvl++;
findKeywords ( $3 );
}
# Process non-keyword text
elsif (defined $4) {
print "$4";
}
elsif (defined $5) {
print "$5";
}
}
if ($lvl > 0) {
print ']';
$lvl--;
}
}
__DATA__
IF( some junk IF (inner meter(s)) )
THEN {
IF ( its in
here
( IF (a=5)
ELSIF
( b=5
and IF( a=4 or
IF(its Monday) and there are
IF( ('lots') IF( ('of') IF( ('these') ) ) )
)
)
)
then its ok
)
ELSIF ( or here() )
ELSE (or nothing)
}
Output:
IF( some junk IF (inner meter(s)) )
THEN {
IF ( its in
here
( IF (a=5)
ELSIF
( b=5
and IF( a=4 or
IF(its Monday) and there are
IF( ('lots') IF( ('of') IF( ('these') ) ) )
)
)
)
then its ok
)
ELSIF ( or here() )
ELSE (or nothing)
}
- - -
IF[ some junk IF [inner meter(s)] ]
THEN {
IF [ its in
here
( IF [a=5]
ELSIF
[ b=5
and IF[ a=4 or
IF[its Monday] and there are
IF[ ('lots') IF[ ('of') IF[ ('these') ] ] ]
]
]
)
then its ok
]
ELSIF [ or here() ]
ELSE (or nothing)
}
Expanding on Paolo's answer, you might also need to worry about spaces and case:
/IF\s*(\(\s*.+?\s*\))/i
This should work and capture all the text between parentheses, including both parentheses, as the first match:
/IF(\(.+?\))/
Please note that it won't match IF() (empty parentheses): if you want to match empty parentheses too, you can replace the + (match one or more) with an * (match zero or more):
/IF(\(.*?\))/
--- EDIT
If you need to match formulas with parentheses (besides the outmost ones) you can use
/IF(\(.*\))/
which will make the regex "not greedy" by removing the ?. This way it will match the longest string possible. Sorry, I assumed wrongly that you did not have any sub-parentheses.
It's not possible only using regular expressions. If you are or can use .NET you should look in to using Balanced Matching.