Perl remove characters from string matrix according to pattern - regex

I have multiple strings with the same length stored in a hash data structure. Example:
$VAR1 = {
'first' => 'abcXY',
'second' => 'XYXYa',
'third' => '*abXZ'
};
From this 'matrix' of characters, I would like to remove the 'columns' which exclusively contain the characters X or Y. In the above example, this would be the fourth character of each string (4th 'column').
The desired result would be:
$VAR1 = {
'first' => 'abcY',
'second' => 'XYXa',
'third' => '*abZ'
};
Following code does this by creating a transpose of the values of my hash structure and then determines which indices to keep:
# data structure
my %h = ('first'=>'abcXY', 'second'=>'XYXYa', 'third'=>'*abXZ' );
# get length of all values in hash
my $nchar = length $h{(keys(%h))[0]};
# transpose values of hash
my #transposed = map { my $idx=$_; [map {substr ($_, $idx, 1) } values(%h)] } 0..$nchar-1;
# determine indices which I want to keep
my #indices;
for my $i (0..$#transposed){
my #a = #{$transposed[$i]};
# do not keep index if column consists of X and Y
if ( scalar(grep {/X|Y/} #a) < scalar(#a) ) {
push #indices, $i;
}
}
# only keep letters with indices
for my $k (keys %h){
my $str = $h{$k};
my $reduced = join "", map{ substr ($str, $_, 1) } #indices;
$h{$k} = $reduced;
}
This is a terrible amount of code for such a simple operation. How could I do this more elegantly (preferably not with some matrix library, but with standard perl)?
Edit
Here another example: From the following strings, the first and last characters should be removed, because in both strings, the first and last position is either X or Y:
$VAR1 = {
'1' => 'Xsome_strX',
'2' => 'YsomeXstrY'
};
Desired result:
$VAR1 = {
'1' => 'some_str',
'2' => 'someXstr'
};

my $total = values %hash;
my %ind;
for my $v (values %hash) {
$ind{ pos($v) -1 }++ while $v =~ /[XY]/g;
}
my #to_remove = sort {$b <=> $a} grep { $ind{$_} == $total } keys %ind;
for my $v (values %hash) {
substr($v, $_, 1, "") for #to_remove;
}

Related

Match key elements in same hash with regex and store it in HoA

I have list of elements in a hash (%Hash).
I need to compare key elements each other and if one key matches with another key (with certain condition), then it would become a pair and should be stored it in HOA.
Here is my script:
use strict; use warnings;
use Data::Dumper;
my (%A, %Final);
my %Hash = (
'Network=ROOT,Network=R16,Me=4462,Element=1,Node=1,Sec=1,Car=3' => 1,
'Network=ROOT,Network=R16,Me=4462,Element=1,Equipment=1,Rack=1,Slot=1,Unit=1,DeviceSet=1,Device=1' => 1
);
foreach my $car (keys %Hash){
if($car =~ m/Car=\d+/){
if($car =~ /(.*),Node/){
$A{$1} = $car;
}
}
}
print Dumper(\%A);
foreach (keys %Hash){
if($_ =~ m/$A{$_}(.*)(Device=\d+)$/){
push #{$Final{$_}}, $A{$_};
}
}
print "Final:\n".Dumper(\%Final);
Let's say in example, if any element contains Network=ROOT,Network=R16,Me=4462,Element=1 then if any other element contains above data as well as key ends with Device=\d+ then the both should become a key and array of values.
Current result:
$VAR1 = {
'Network=ROOT,Network=R16,Me=4462,Element=1,Equipment=1,Rack=1,Slot=1,Unit=1,DeviceSet=1,Device=1' => [
undef
]
};
Expected Result:
$VAR1 = {
'Network=ROOT,Network=R16,Me=4462,Element=1,Equipment=1,Rack=1,Slot=1,Unit=1,DeviceSet=1,Device=1' => [
Network=ROOT,Network=R16,Me=4462,Element=1,Node=1,Sec=1,Car=3
]
};
Why I am getting undef value in the HOA.
Update:
I have updated the code, in the second hash iteration the code looks like this:
foreach my $ele (keys %Hash){
foreach my $s_c(keys %A){
if($ele =~ m/$s_c(.*)(Device=\d+)$/){
push #{$Final{$ele}}, $A{$s_c};
}
}
}
I am able to get the desired result now. But is there any compact way to achieve this (since I am iterating hash within a hash).
You can use maps to shorten your code. Here is the solution.
#!/usr/bin/perl
use strict; use warnings;
use Data::Dumper;
my (%A, %Final);
my %Hash = (
'Network=ROOT,Network=R16,Me=4462,Element=1,Node=1,Sec=1,Car=3' => 1,
'Network=ROOT,Network=R16,Me=4462,Element=1,Equipment=1,Rack=1,Slot=1,Unit=1,DeviceSet=1,Device=1' => 1
);
map {
$A{$1} = $_ if $_ =~ m/Car=\d+/ && $_ =~ /(.*),Node/
} keys %Hash;
print Dumper(\%A);
map {
my $ele = $_;
map {
my $s_c = $_;
push #{$Final{$ele}}, $A{$s_c} if $ele =~ m/$s_c.*Device=\d+$/;
} keys %A;
} keys %Hash;
print "Final:\n".Dumper(\%Final);

Perl - Using regex to match input in hash key or value

First, this is a homework assignment. I am having a tough time with regex, and I'm stuck.
This is the code I have so far, where I have the user designate a filename, and if it exists, populates a hash of the names as keys, and the phone numbers as the values.
#!/usr/bin/perl
use strict;
print "\nEnter Filename: ";
my $file = <STDIN>;
chomp $file;
if(!open(my $fileName, "<", "$file"))
{
print "Sorry, that file doesn't exist!", "\n";
}
else
{
my %phoneNums;
while (my $line=<$fileName>)
{
chomp($line);
(my $name,my $number) = split /:/, $line;
$phoneNums{$name} = $number;
}
print "Read in the file!", "\n\n";
print "Enter search: ";
my $input = <STDIN>;
chomp $input;
#HERE IS WHERE I'M LOST
}
print "\n";
This is the part I am stuck on:
Allow the user to enter a search string.
Look for matches using the same style as the phone. Any individual
character in the search string can match any other character from the
key, meaning a ‘2’ in the search string can match a ‘2’, ‘A’, ‘B’, or ‘C’ in the contact list. Matches can occur in the contact name or the phone number. For a match to occur, each character in the search string must appear, in order, in the contact info, but not necessarily next to each
other. For example, a search string of “86” (essentially the same as a search string of “TM” or “NU”) would match “TOM” but not “MOTHER”.
Characters on each phone keys:
0,
1,
2ABC,
3DEF,
4GHI,
5JKL,
6MNO,
7PQRS,
8TUV,
9WXYZ
I just am stuck on how exactly to make all those character classes, and any help at all is much appreciated.
The way to tackle this is by writing a function that reduces your 'things' to their common components. The best way to do this IMO is use a hash:
my %num_to_letter = (
0 => [],
1 => [],
2 => [ "A", "B", "C" ],
3 => [ "D", "E", "F" ],
4 => [ "G", "H", "I" ],
5 => [ "J", "K", "L" ],
## etc.
);
my %letter_to_num;
foreach my $key ( keys %num_to_letter ) {
foreach my $element ( #{$num_to_letter{$key}} ) {
$letter_to_num{lc($element)} = lc($key);
}
}
print Dumper \%letter_to_num;
This creates a map of which letters or numbers map to their original - a bit like this:
$VAR1 = {
'b' => '2',
'g' => '4',
'e' => '3',
'i' => '4',
'a' => '2',
'j' => '5',
...
Note - you can do this by hand, but I prefer to generate from the top map, because I think it looks neater. Note - we use lc to lower case everything, so this becomes case insensitive. It's probably worth looking at fc - which is a similar tool but handles international characters. (Not relevant in this example though)
You then 'reduce' both search and 'target' to their common values:
sub normalise {
my ( $input ) = #_;
#join with no delimiter.
return join ( '',
#look up $_ (each letter) in $letter_to_num
#if not present, use // operator to return original value.
#this means we get to turn numbers into letters,
#but leave things that are already numbers untouched.
map { $letter_to_num{lc($_)} // $_ }
#split the input line into characters.
split ( //, $input )
);
}
print normalise ("DAD"),"\n"; ## 323
And then compare one against the other:
my $search = "DAD";
my $normalised_search = normalise($search);
print "Searching for: \"$normalised_search\"\n";
my $number_to_match = '00533932388';
my $string_to_match = "daddyo";
print "Matches number\n"
if normalise($number_to_match) =~ m/$normalised_search/;
print "Matches string\n"
if normalise($string_to_match) =~ m/$normalised_search/;
Here's an almost procedural approach that cheats a bit by using Hash::MultiValue:
use Hash::MultiValue; # makes reversing and flattening easier
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hash::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# TOM in numbers - 866 in letters
my $letters = "TOM" ;
print join '', $let2num{$_} // $_ for (split('', $letters)), "\n";
my $phone_input = "866" ;
print join '', #{$num2let{$_}}," " for (split('', $phone_input)) , "\n";
Output:
866
TUV MNO MNO
So here "TOM" would overlap with "UNO" ... I like #Sobrique's answer :-)
To search an array/list of contact names using the phone keypad input we can create a hash containing keys and values of the names and their number equivalents and then match the "converted" name value against the input:
use Hash::MultiValue; # makes reversing and flattening easier
my #contacts = <DATA> ;
chomp #contacts;
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hasj::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# create key/value pairs for contact database
my %contacts2nums ;
for $contact (#contacts) {
$contacts2nums{$contact} = join "",
map { $let2num{$_} } split('', uc $contact);
}
my $phone_input = "866";
for my $contact (keys %contacts2nums) {
print "The text: \"$contact\" matches the input: \"$phone_input\" \n"
if $phone_input eq $contacts2nums{$contact};
}
__DATA__
Tom
Mother
TIMTOWDI
DAD
Gordon
Output:
The text: "Tom" matches the input: "866"
A more organized approach would wrap the conversion operation in a function.
Addendum:
With a real keypad you could probably come up with a simple algorithm that could be more deterministic regarding the letter you want to associate with the number on the keypad. You could iterate through the array based on number of presses of the key: e.g. two presses on "2" would be equal to "B", etc. You'd just have to figure out how/when to move to the next character with some kind of timeout/wait value. This way you would have a more exact string on which to base your search.

Counting repeated characters around nth character of string

For a part of my University project, i am trying to count base repeats around the 11th character of 21 bp sequences of DNA. I want to look at the 11th character, then if there are repeated identical characters around it, to print them.
For example:
GCTAAAGTAAAAGAAGATGCA
Would give results of:
11th base is A, YES repeated 4 times
I really don't know how to go about this, to get the 11th character i'm sure i can use a regex but after that i'm not sure.
To start with I have playing around using a hash and looking for the number of occurrences of different nucleotide groups in each sequence, as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/hash.txt";
open FILE1, "/Users/edwardtickle/Documents/randomoutput.txt";
open( OUTPUTFILE, ">$outputfile" );
while (<FILE1>) {
if (/^(\S+)/) {
my $dna = $1;
my #repeats = ( $dna =~ /[A]{3}/g );
my #changes = ( $dna =~ /[T]{2}/g );
my %hash = ();
my %hash1 = ();
for my $repeats (#repeats) {
$hash{$repeats}++;
}
for my $changes (#changes) {
$hash1{$changes}++;
}
for my $key ( keys %hash ) {
print OUTPUTFILE $key . " => " . $hash{$key} . "\n";
}
for my $key1 ( keys %hash1 ) {
print OUTPUTFILE $key1 . " => " . $hash1{$key1} . "\n";
}
}
}
FILE1 data:
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Gives results of:
TT => 2
AAA => 1
TT => 4
AAA => 1
TT => 2
TT => 4
AAA => 1
TT => 2
TT => 5
AAA => 1
TT => 1
AAA => 2
TT => 1
TT => 2
When for this sample data set i would like a cumulative tally of every sequence, rather than number of individual occurrences in each matching string, like this:
AAA => 6
TT => 23
How do i go about changing the output? And how do i prevent a string of TTTTT bases showing up as TT => 2? Then if anyone has any recommendations of how to go about the original problem/if it is even possible, that would be greatly appreciated.
Thanks in advance!
Using a regular expression:
use strict;
use warnings;
my $char = 11; # Looking for the 11th character, or position 10.
while (<DATA>) {
chomp;
if (m{
( (.) \2*+ ) # Look for a repeated character sequence
(?<= .{$char} ) # Must include pos $char - 1
}x) {
printf "%s => %d\n", $2, length($1);
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Output:
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
This code should do what you need. There really isn't a regular expression that will find the longest sequence of a given character at and around a given character position. This code works by splitting the string $seq into an array of characters #seq and then searching forwards and backwards from the centre.
It's practical to do things this way because the sequence is relatively short, and as long as there's an odd numbers of characters in the string it will calculate the centre point for you.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my #seq = unpack '(A1)*', $seq;
my $len_seq = #seq;
my $c_offset = ($len_seq - 1) / 2;
my $c_char = $seq[$c_offset];
my ($start, $end) = ($c_offset, $c_offset + 1);
--$start while $start > 0 and $seq[$start-1] eq $c_char;
++$end while $end < $len_seq and $seq[$end] eq $c_char;
return $c_char, $end-$start;
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
output
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
Update
Here's a better way. It's shorter and faster, and works by all the subsequences of the same character until it finds a sequence that spans the middle character.
The output is identical to that of the above.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my $mid_seq = length($seq) / 2;
while ( $seq =~ /(.)\1*/g ) {
if ($-[0] < $mid_seq and $+[0] > $mid_seq) {
return $1, $+[0]-$-[0];
}
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC

Perl substitute strings in hash

Hi I have an perl hash of arbitrary depth. I want to substitute the a string in entire structure with something else.
What is the right approach to do it?
I did something like this
#convert the hash to string for manipulation
my $data = YAML::Dump(%hash);
# do manipulation
---
---
# time to get back the hash
%hash = YAML::Load($data);
Your idea seems very risky to me, since it can be hard to be sure that the substitution won't destroy something in the output of YAML::Dump that will prevent the result from being read back in again, or worse, something that will alter the structure of the hash as it is represented in the dump string. What if the manipulation you are trying to perform is to replace : with , or ’ with ', or something of that sort?
I would probably do something more like this:
use Scalar::Util 'reftype';
# replace $this with $that in key names and string values of $hash
# recursively apply replacement in hash and all its subhashes
sub hash_replace {
my ($hash, $this, $that) = #_;
for my $k (keys %$hash) {
# substitution in value
my $v = $hash->{$k};
if (ref $v && reftype($v) eq "HASH") {
hash_replace($v, $this, $that);
} elsif (! ref $v) {
$v =~ s/$this/$that/og;
}
my $new_hash = {};
for my $k (keys %$hash) {
# substitution in key
(my $new_key = $k) =~ s/$this/$that/og;
$new_hash->{$new_key} = $hash->{$k};
}
%$hash = %$new_hash; # replace old keys with new keys
}
The s/…/…/ replacement I used here may not be appropriate for your task; you should feel free to use something else. For example, instead of strings $this and $that you might pass two functions, $key_change and $val_change which are applied to keys and to values, respectively, returning the modified versions. See the ###### lines below:
use Scalar::Util 'reftype';
# replace $this with $that in key names and string values of $hash
# recursively apply replacement in hash and all its subhashes
sub hash_replace {
my ($hash, $key_change, $val_change) = #_;
for my $k (keys %$hash) {
# substitution in value
my $v = $hash->{$k};
if (ref $v && reftype($v) eq "HASH") {
hash_replace($v, $key_change, $val_change);
} elsif (! ref $v) {
$v = $val_change->($v); #######
}
}
my $new_hash = {};
for my $k (keys %$hash) {
# substitution in key
my $new_key = $key_change->($k); #######
$new_hash->{$new_key} = $hash->{$k};
}
%$hash = %$new_hash;
}
Here's one way to attack it, by recursing through the hash. In this code, you pass in a sub that does whatever you like to each value in the nested hash. This code only modifies the values, not the keys, and it ignores other reference types (ie. scalar refs, array refs) in the nested structure.
#!/usr/bin/perl -w
use Modern::Perl;
## Visit all nodes in a nested hash. Bare-bones.
sub visit_hash
{
my ($start, $sub) = #_;
my #q = ( $start );
while (#q) {
my $hash = pop #q;
foreach my $key ( keys %{$hash} ) {
my $ref = ref($hash->{$key});
if ( $ref eq "" ) { # not a reference
&$sub( $hash->{$key} );
next;
}
if ( $ref eq "HASH" ) { # reference to a nested hash
push #q, $hash->{$key};
next;
}
# ignore other reference types.
}
}
}
The following gives an example of how to use it, replacing e with E in a nested hash:
# Example of replacing a string in all values:
my %hash =
(
a => "fred",
b => "barney",
c => "wilma",
d => "betty",
nest1 =>
{
1 => "red",
2 => "orange",
3 => "green"
},
nest2 =>
{
x => "alpha",
y => "beta",
z => "gamma"
},
);
use YAML::XS;
print "Before:\n";
print Dump( \%hash );
# now replace 'e' with 'E' in all values.
visit_hash( \%hash, sub { $_[0] =~ s/e/E/g; } );
print "After:\n";
print Dump( \%hash );

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.