How can I extract blocks from this configuration file using Perl? - regex

I am trying to search through a Load Balancer config and extract some data. The configuration file looks like this
pool {
name "POOL_name1"
ttl 30
monitor all "tcp"
preferred rr
partition "Common"
member 12.24.5.100:80
}
pool {
name "Pool-name2"
ttl 30
monitor all "https_ignore_dwn"
preferred rr
fallback rr
partition "Common"
member 69.241.25.121:8443
member 69.241.25.122:8443
}
I am trying to assign each pool config to it's own array, so I can loop through the array to find specific IP addresses and pool names. I tried the following regex, but its not working.
my #POOLDATA = <FILE>;
close FILE;
foreach (#POOLDATA) {
if (/^pool\s\{\s/ .. /^\}\s/) {
push (#POOLCONFIG, "$_");
}
}
Does anyone have a suggestion on how to separate each pool config into its own array? (or a better suggestion) Thank you in advance for your help

#!/usr/bin/env perl
use warnings; use strict;
my #pools;
my $keys = join('|', sort
'name',
'ttl',
'monitor all',
'preferred',
'partition',
'member'
);
my $pat = qr/^($keys)\s+([^\n]+)\n\z/;
while ( my $line = <DATA> ) {
if ($line =~ /^pool\s+{/ ) {
push #pools, {},
}
elsif (my ($key, $value) = ($line =~ $pat)) {
$value =~ s/^"([^"]+)"\z/$1/;
push #{ $pools[-1]->{$key} }, $value;
}
}
use Data::Dumper;
print Dumper \#pools;
__DATA__
pool {
name "POOL_name1"
ttl 30
monitor all "tcp"
preferred rr
partition "Common"
member 12.24.5.100:80
}
pool {
name "Pool-name2"
ttl 30
monitor all "https_ignore_dwn"
preferred rr
fallback rr
partition "Common"
member 69.241.25.121:8443
member 69.241.25.122:8443
}
Output:
$VAR1 = [
{
'monitor all' => [
'tcp'
],
'member' => [
'12.24.5.100:80'
],
'ttl' => [
'30'
],
'name' => [
'POOL_name1'
],
'preferred' => [
'rr'
],
'partition' => [
'Common'
]
},
{
'monitor all' => [
'https_ignore_dwn'
],
'member' => [
'69.241.25.121:8443',
'69.241.25.122:8443'
],
'ttl' => [
'30'
],
'name' => [
'Pool-name2'
],
'preferred' => [
'rr'
],
'partition' => [
'Common'
]
}
];
Edit:
Of course, you can check for a member element, and fill in a default one if it isn't found. In fact, with the basic structure in place, you should have been able to do that yourself.
One way to do it is to check for the end of a pool record:
while ( my $line = <DATA> ) {
if ($line =~ /^pool\s+{/ ) {
push #pools, {},
}
elsif (my ($key, $value) = ($line =~ $pat)) {
$value =~ s/^"([^"]+)"\z/$1/;
push #{ $pools[-1]->{$key} }, $value;
}
elsif ($line =~ /^\s*}/) {
my $last = $pools[-1];
if ($last and not $last->{member}) {
$last->{member} = [ qw(0.0.0.0) ];
}
}
}

As Sinan Unur recommended, you can store a reference to a hash in your array. That way, each element of your array is a hash.
By the way, Sinan's data structure is a bit more complex: You have an array of pools. Each pool is a hash with a key that's the value of the name of the pool element, and a reference to an array. This way, each element in the pool could have multiple values (like your IP addresses do).
My only comment is that I might use a hash for storing the pools, and key it by IP address. That is, assuming that an IP address is unique to a particular pool. That way, you can easily pull up a pool by the IP address without having to search. I would also keep a parallel structure by pool name for the same reason. (And, since each pool is a reference, storing the pool by both IP address and name wouldn't take up that much extra memory. And, updating one would update the other automatically).
If you're not familiar with Perl references, or how to create an array or hashes, or a hash of arrays, you can take a look at the following Perl tutorials:
perlretut - Mark's very short tutorial about references
perldsc - Perl Data Structures Cookbook
Once you get the hang of using multilayered Perl structures, you can quickly learn how to use object oriented design in your Perl scripts, and make maintaining these structures very easy to do.

Just another way to look at it. This one handles the multiple member fields, specially.
use strict;
use warnings;
use Data::Dumper;
use English qw<$RS>;
use List::MoreUtils qw<natatime>;
use Params::Util qw<_ARRAY _CODE>;
# Here, we rig the record separator to break on \n}\n
local $RS = "\n}\n";
# Here, we standardize a behavior with hash duplicate keys
my $TURN_DUPS_INTO_ARRAYS = sub {
my ( $hr, $k, $ov, $nv ) = #_;
if ( _ARRAY( $ov )) {
push #{ $ov }, $nv;
}
else {
$h->{ $k } = [ $ov, $nv ];
}
};
# Here is a generic hashing routine
# Most of the work is figuring out how the user wants to store values
# and deal with duplicates
sub hash {
my ( $code, $param_name, $store_op, $on_duplicate );
while ( my ( $peek ) = #_ ) {
if ( $code = _CODE( $peek )) {
last unless $param_name;
if ( $param_name eq 'on_dup' ) {
$on_duplicate = shift;
}
elsif ( $param_name eq 'store' ) {
$store_op = shift;
}
else {
last;
}
undef $code;
}
else {
my #c = $peek =~ /^-?(on_dup|store$)/;
last unless $param_name = $c[0];
shift;
}
}
$store_op ||= sub { $_[0]->{ $_[1] } = $_[3]; };
$on_duplicate ||= $code || $store_op;
my %h;
while ( #_ ) {
my $k = shift;
next unless defined( my $v = shift );
(( exists $h{ $k } and $on_duplicate ) ? $on_duplicate
: $store_op
)->( \%h, $k, $h{ $k }, $v )
;
}
return wantarray ? %h : \%h;
}
my %pools;
# So the loop is rather small
while ( <DATA> ) {
# remove pool { ... } brackets
s/\A\s*pool\s+\{\s*\n//smx;
s/\n\s*\}\n*//smx;
my $h
= hash( -on_duplicate => $TURN_DUPS_INTO_ARRAYS
, map { s/"$//; s/\s+$//; $_ }
map { split /\s+"|\s{2,}/msx, $_, 2 }
split /\n/m
);
$pools{ $h->{name} } = $h;
}
print Dumper( \%pools );
### %pools
__DATA__
pool {
name "POOL_name1"
ttl 30
monitor all "tcp"
preferred rr
partition "Common"
member 12.24.5.100:80
}
pool {
name "Pool-name2"
ttl 30
monitor all "https_ignore_dwn"
preferred rr
fallback rr
partition "Common"
member 69.241.25.121:8443
member 69.241.25.122:8443
}
Just a note about the hash function, I had noticed a high number of posts recently about hashes that handle duplicates. This is a general solution.

Related

Searching Perl array with regex and return single capturing group only

I have a Perl script in which I perform web service calls in a loop. The server returns a multivalued HTTP header that I need to parse after each call with information that I will need to make the next call (if it doesn't return the header, I want to exit the loop).
I only care about one of the values in the header, and I need to get the information out of it with a regular expression. Let's say the header is like this, and I only care about the "foo" value:
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
I can get the header values like this: #values = $response->header( 'X-Header' );. But how do I quickly check if
There is a foo value, and
Parse and save the foo value for the next iteration?
Ideally, I'd like to do something like this:
my $value = 'default';
do {
# (do HTTP request; use $value)
#values = $response->header( 'X-Header' );
} while( $value = first { /(?:test-)([^;]+)(?:; blah=foo)/ } #values );
But grep, first (from List::Util), etc. all return the entire match and not just the single capturing group I want. I want to avoid cluttering up my code by looping over the array and matching/parsing inside the loop body.
Is what I want possible? What would be the most compact way to write it? So far, all I can come up with is using lookarounds and \K to discard the stuff I don't care about, but this isn't super readable and makes the regex engine perform a lot of unnecessary steps.
So it seems that you want to catch the first element with a certain pattern, but acquire only the pattern. And you want it done nicely. Indeed, first and grep only pass the element itself.
However, List::MoreUtils::first_result does support processing of its match
use List::MoreUtils 0.406 qw(first_result);
my #w = qw(a bit c dIT); # get first "it" case-insensitive
my $res = first_result { ( /(it)/i )[0] } #w;
say $res // 'undef'; #--> it
That ( ... )[0] is needed to put the regex in the list context so that it returns the actual capture. Another way would be firstres { my ($r) = /(it)/i; $r }. Pick your choice
For the data in the question
use warnings;
use strict;
use feature 'say';
use List::MoreUtils 0.406 qw(firstres);
my #data = (
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
);
if (my $r = firstres { ( /test-([^;]+);\s+blah=foo/ )[0] } #data) {
say $r
}
Prints abc12345, clarified in a comment to be the sought result.
Module versions prior to 0.406 (of 2015-03-03) didn't have firstres (alias first_result)
first { ... } #values returns one the values (or undef).
You could use either of these:
my ($value) = map { /...(...).../ } #values;
my $value = ( map { /...(...).../ } #values ) ? $1 : undef;
my $value = ( map { /...(...).../ } #values )[0];
Using first, it would look like the following, which is rather silly:
my $value = first { 1 } map { /...(...).../ } #values;
However, assuming the capture can't be an empty string or the string 0, List::MoreUtils's first_result could be used to avoid the unnecessary matches:
my $value = first_result { /...(...).../ ? $1 : undef } #values;
my $value = first_result { ( /...(...).../ )[0] } #values;
If the returned value can be false (e.g. an empty string or a 0) you can use something like
my $value = first_result { /...(...).../ ? \$1 : undef } #values;
$value = $$value if $value;
The first_result approach isn't necessarily faster in practice.
Following code snippet is looking for foo stored in a variable $find, the found values is stored in variable $found.
my $find = 'foo';
my $found;
while( $response->header( 'X-Header' ) ) {
if( /X-Header: .*?blah=($find)/ ) {
$found = $1;
last;
}
}
say $found if $found;
Sample demo code
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $find = 'foo';
my $found;
my #header = <DATA>;
chomp(#header);
for ( #header ) {
$found = $1 if /X-Header: .*?blah=($find)/;
last if $found;
}
say Dumper(\#header);
say "Found: $found" if $found;
__DATA__
X-Header: test-abc12345; blah=foo
X-Header: test-fgasjhgakg; blah=bar
Output
$VAR1 = [
'X-Header: test-abc12345; blah=foo',
'X-Header: test-fgasjhgakg; blah=bar'
];
Found: foo

Parsing data with perl- capturing a range of text

I'm writing code to parse all the interfaces on my network, looking for certain configurations.. etc.
the data looks like this:
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
Basically starting with "^interface " and ending "!".
my current algorithm is to "record" the data I need
foreach $line (#input) {
if ( $line =~ m/^interface.+\d/ && $line !~ m/interface Embedded-Service-Engine|BRI|TenGigabitEthernet|vlan|Port-channel|ATM|loopback/i) {
$record = 1;
}
#$int ne '' is to handle the rest of the file not in this format
if( $line =~ m/!/ && $int ne '') {
#save data in format 'interface fa2,mode trunk'
#if the interface doesn't have port-security
push(#intlist, join(','split("\r\n",$int))."\n") unless $int =~ m/port-security/;
$record=0;
$int='';
}
if ($record) {
$int.=$line;
}
}
while this works in my case, I'd like a simply way to do it. I've searched and found that you can use the range operator '..' on regex
which turns my code into :
#input # contains the file
#interfaces = grep (/^interface.+\d/ .. /!/, #input);
which gives me all the interface data, the problem is now every line is a single element in the #interfaces array. how can I then split this data up so everything from /^interface.+\d/ .. /!/ is one element in this array without creating more for loops?
The goal is to get it down to one element so I can then scan it for interfaces I don't want to look at interface Embedded-Service-Engine|BRI|TenGigabit as well as interfaces that have the correct configurations.
Have a look at $/ because I think that'll help. It's the record separator - which defaults to \n.
Then you can apply regular expressions to the current 'chunk' to pull out the data you require - by default a regular expression/capture group applies to $_ the implicit variable.
E.g.
#!/usr/bin/perl
use strict;
use warnings;
local $/ = '!';
while ( <DATA> ) {
my ( $interface ) = m/Interface (\w+)/i;
next if $interface =~ m/Embedded-Service-Engine/;
my ( $mode ) = m/mode (\w+)/;
print "$interface $mode\n";
print "---next record---\n";
}
__DATA__
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
If you need to keep the data for other uses (e.g. 'process as you go' isn't suitable) then the tool for the job is a hash.
You can either use something like the above - and populate the hash with particular keys you're interested in - or use the magic of map to do it for you.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $/ = '!';
my %interfaces;
while ( <DATA> ) {
my ( $interface ) = m/Interface (\w+)/i;
next if $interface =~ m/Embedded-Service-Engine/;
my %interface_values = map { my ( $key, $value ) = split; $key, $value || 1 } grep { /\w/ } split ( "\n" );
$interfaces{$interface} = \%interface_values;
}
print Dumper \%interfaces
__DATA__
Interface fa1
mode access
port-security
mac-address sticky
!
interface fa2
mode trunk
!
That map line basically:
splits the current record on \n to get each line.
filters 'not word' values (so blank lines and !)
splits each line on whitepace, to get a key and value pair.
If no value is defined, sets it to 1. (so in the example, port-security )
Populates a hash with these key-value pairs.
and then updates %interfaces with the hash for each interface ID.
Giving something like:
$VAR1 = {
'fa1' => {
'port-security' => 1,
'mode' => 'access',
'Interface' => 'fa1',
'mac-address' => 'sticky'
},
'fa2' => {
'mode' => 'trunk',
'interface' => 'fa2'
}
};
A hash or hashref would be a result where you can work with. Furthermore, reading records based on a fixed structure can be read using a matching regex. Like so:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
our %MATCH;
*MATCH = \%+;
# read file into variable
my ( $file, $data ) = ( 'interfaces.txt', undef );
open( my $fh, '<', $file ) or die "cannot open file $file";
{
local $/;
$data = <$fh>;
}
close($fh);
print Dumper $data;
my $regex = qr{
(?sm)
interface [^\w\n]+
(?<interface> (\w[^\n]+))
[^\w]+
mode [^\w]+
(?<mode> (\w[^\n]+))
[^\w]+
((?<portsecurity> port-security)
[^\w]+)? # port-security is optional
(mac-address [^\w]+
(?<macaddress> (\w[^\n]+))
)? # mac-address is optional
[^!]*
!
}x;
my $results = {};
while ( $data =~ m/$regex/g ) {
my $interface = $MATCH{interface};
$results->{$interface} = { mode => $MATCH{mode} ? $MATCH{mode} : '' };
$results->{$interface}->{'port-security'} = 1
if defined $MATCH{portsecurity};
$results->{$interface}->{macaddress} = $MATCH{macaddress}
if defined $MATCH{macaddress};
}
print Dumper $results;
The result from your input is:
$VAR1 = {
'fa1' => {
'macaddress' => 'sticky',
'mode' => 'access',
'port-security' => 1
},
'fa2' => {
'mode' => 'trunk'
}
};
Having a hash with the interface names as key values, gives you the opportunity to use a 'grep' for the interfaces you want.
If your structure is not fixed - there is no ordering in your fields mode, port-security, mac-address - then you would need to read a interface record in one go, and split up the fields using separate regexes for each field.
This is my Final solution. In this particular case I'm searching for all switchports that have a maximum port-security not equal to 1. This is just an example and can be switched for any configuration. I'm also omitting certain interfaces from being caught if that configuration is actually applied to them.
#!/usr/bin/perl
$MDIR='/currentConfig';
#list of interfaces you don't want to see to filter output
#omit =(
'MANAGEMENT.PORT',
'sup.mgmt',
'Internal.EtherSwitch',
'Router',
'ip address \d',
'STRA'
);
#join with '|' to form the regex
$dontwant = join('|',#omit);
#search criteria
$search='switchport port-security maximum [^1]';
opendir(DIR,$MDIR) or die $!;
#dirContents=readdir DIR;close DIR;
foreach $file (#dirContents) {
open(IN,$MDIR.'/'.$file) or die $!;
#record seperator to !
$/='!';
my #inFile=<IN>; close IN;
#since the record seperator has been changed, '^' won't match beginning of line
my #ints = grep (/\ninterface/i,#inFile);
#set record seperator back to normal
$/="\n";
foreach $int (#ints) {
if ( $int =~ m/$search/i && $int !~ m/$dontwant/) {
push(#finalint,$int);
}
}
}
#just list the interfaces found, i'll use this to make it comma seperated
foreach $elem (#finalint) {
print $elem;
}

perl function to parse the command output

I have written a perl function that executes a command and parses the command output. The command will provide the output mentioned below. In the function call i pass a number between 1 to 5 as arguments . '1' correspond to lan, 2 corresponds to wan, '3' corresponds to name and so on. ( see the output below) . for example if 1 is passed as an argument in the function call the expected output is '0' ( 1 = lan and value of lan = 0 ) . When i execute the script im not getting the expected output. null is returned.Any suggestions?
command output 1 :
[
{
"lan" : 0, #1
"wan" : 0, #2
"name" : "XYZ", #3
"packets" : 0, #4
"bytes" : 0 #5
}
]
Function call
$self->{'stats_obj'} = Statistics->new( ip => "ip addr")
my $result = $self->{'stats_obj'}->statistics_get('1');
INFO('statistics:' . $result );
Function:
sub statistics_get{
my ($self, $option)= #_;
my $result = $self->_get_hyd_tc();
return $result->{$option};
}
sub _get_hyd_tc {
my ($self) = #_;
my $opt;
my %result;
my $line;
my $cmd = 'cmd goes here';
$self->execute($cmd);
my $count =0;
foreach my $line ( $self->output() ) {
chomp $line;
if ( $line =~ /(Requested table doesn't.*)/i ){
ERROR('table doesnt exist' . $line)
}
if ($line =~ /(.*)/) {
$opt = $1;
$count = 0;
}
elsif ( $line =~ /(.*)/) {
my $key = $1;
my $value = $2;
$result{$opt}{++$count} = $value;
}
}
return \%result;
}
You are approaching this the wrong way.
Your snippet you give is JSON. So really - your best bet by far is to process it as JSON and not try and parse it yourself.
Something like this:
use strict;
use warnings;
use JSON;
use Data::Dumper;
my $command_output = '[
{
"lan" : 0,
"wan" : 0,
"name" : "XYZ",
"packets" : 0,
"bytes" : 0
}
]';
my $json_ob = decode_json( $command_output );
print Dumper \$json_ob;
print $json_ob -> [0] -> {'name'};
Those [] in your text denotes an array. usually that means multiple elements. You could iterate those elements, but as you've just got one, accessing it via [0] does the trick.
Now, you could if you really want map the attributes from 'words' to numeric, but ... there isn't really any need.
But to answer your question - why is null returned - it's because:
if ($line =~ /(.*)/) {
Will always evaluate as true - zero or more of anything.
And therefore you will never run the second elsif loop and so this never happens:
$result{$opt}{++$count} = $value;
And so you never have anything other than an empty array to return.

Perl substitute strings in hash

Hi I have an perl hash of arbitrary depth. I want to substitute the a string in entire structure with something else.
What is the right approach to do it?
I did something like this
#convert the hash to string for manipulation
my $data = YAML::Dump(%hash);
# do manipulation
---
---
# time to get back the hash
%hash = YAML::Load($data);
Your idea seems very risky to me, since it can be hard to be sure that the substitution won't destroy something in the output of YAML::Dump that will prevent the result from being read back in again, or worse, something that will alter the structure of the hash as it is represented in the dump string. What if the manipulation you are trying to perform is to replace : with , or ’ with ', or something of that sort?
I would probably do something more like this:
use Scalar::Util 'reftype';
# replace $this with $that in key names and string values of $hash
# recursively apply replacement in hash and all its subhashes
sub hash_replace {
my ($hash, $this, $that) = #_;
for my $k (keys %$hash) {
# substitution in value
my $v = $hash->{$k};
if (ref $v && reftype($v) eq "HASH") {
hash_replace($v, $this, $that);
} elsif (! ref $v) {
$v =~ s/$this/$that/og;
}
my $new_hash = {};
for my $k (keys %$hash) {
# substitution in key
(my $new_key = $k) =~ s/$this/$that/og;
$new_hash->{$new_key} = $hash->{$k};
}
%$hash = %$new_hash; # replace old keys with new keys
}
The s/…/…/ replacement I used here may not be appropriate for your task; you should feel free to use something else. For example, instead of strings $this and $that you might pass two functions, $key_change and $val_change which are applied to keys and to values, respectively, returning the modified versions. See the ###### lines below:
use Scalar::Util 'reftype';
# replace $this with $that in key names and string values of $hash
# recursively apply replacement in hash and all its subhashes
sub hash_replace {
my ($hash, $key_change, $val_change) = #_;
for my $k (keys %$hash) {
# substitution in value
my $v = $hash->{$k};
if (ref $v && reftype($v) eq "HASH") {
hash_replace($v, $key_change, $val_change);
} elsif (! ref $v) {
$v = $val_change->($v); #######
}
}
my $new_hash = {};
for my $k (keys %$hash) {
# substitution in key
my $new_key = $key_change->($k); #######
$new_hash->{$new_key} = $hash->{$k};
}
%$hash = %$new_hash;
}
Here's one way to attack it, by recursing through the hash. In this code, you pass in a sub that does whatever you like to each value in the nested hash. This code only modifies the values, not the keys, and it ignores other reference types (ie. scalar refs, array refs) in the nested structure.
#!/usr/bin/perl -w
use Modern::Perl;
## Visit all nodes in a nested hash. Bare-bones.
sub visit_hash
{
my ($start, $sub) = #_;
my #q = ( $start );
while (#q) {
my $hash = pop #q;
foreach my $key ( keys %{$hash} ) {
my $ref = ref($hash->{$key});
if ( $ref eq "" ) { # not a reference
&$sub( $hash->{$key} );
next;
}
if ( $ref eq "HASH" ) { # reference to a nested hash
push #q, $hash->{$key};
next;
}
# ignore other reference types.
}
}
}
The following gives an example of how to use it, replacing e with E in a nested hash:
# Example of replacing a string in all values:
my %hash =
(
a => "fred",
b => "barney",
c => "wilma",
d => "betty",
nest1 =>
{
1 => "red",
2 => "orange",
3 => "green"
},
nest2 =>
{
x => "alpha",
y => "beta",
z => "gamma"
},
);
use YAML::XS;
print "Before:\n";
print Dump( \%hash );
# now replace 'e' with 'E' in all values.
visit_hash( \%hash, sub { $_[0] =~ s/e/E/g; } );
print "After:\n";
print Dump( \%hash );

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.