How do I grab an unknown number of captures from a pattern?

How do I grab an unknown number of captures from a pattern? - regex

Lets say I have a pattern:
<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2
As you can see, the pattern could have multiple values (in this case pin has 2 sets of pin names). The amount is unknown.
How would I parse this? Here is what I have so far, but it is not helpful as it does not take into account if the pattern has more than 2 sets of pins.
my $pattern = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
if ( $pattern =~ m#\<cell\> (\w*=\w*) \<pin\> (\w*=\w*) \<type\> (\w*=\w*)#) {
my $cell_name = $1;
my $pin_name = $2;
my $type_name = $3;
}
as you can see, this will only work if there is only one set of pin names. However I want it to be able to adjust to multiple unknown sets of pin names. I think I would have to construct like an array or hash, but I am not really sure what is the best way of grabbing these values taking into account the unknown multiple pin sets.
I would like to be able to store the cell_name,pin_name and type_name as an array or hash with the sets of values.

Your problem is a bit trickier than Why do I get the first capture group only? but some of those ideas may help. The trick is to stop thinking about doing everything in a single pattern.
If that's really your input, I'd be tempted to match groups of things around an =. Matching in list context, such as assigning to a hash, returns the list of matches:
use Data::Dumper;
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %values = $input =~ m/ (\S+) = (\S+) /gx;
print Dumper( \%values );
The things before the = become keys and the things after become the values:
$VAR1 = {
'pin1' => 'pin2',
'type1' => 'type2',
'cell1' => 'cell2',
'pin3' => 'pin4'
};
But life probably isn't that easy. The example names probably don't really have pin, cell, and so on.
There's another thing I like to do, though, because I miss having all that fun with sscan. You can walk a string by matching part of it at a time, then on the next match, start where you left off. Here's the whole thing first:
use v5.10;
use Data::Dumper;
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %hash;
while( 1 ) {
state $type;
if( $input =~ /\G < (.*?) > \s* /xgc ) {
$type = $1;
}
elsif( $input =~ /\G (\S+) = (\S+) \s* /xgc ) {
$hash{$type}{$1}{$2}++;
}
else { last }
}
print Dumper( \%hash );
And the data structure, which really doesn't matter and can be anything that you like:
$VAR1 = {
'type' => {
'type1' => {
'type2' => 1
}
},
'pin' => {
'pin1' => {
'pin2' => 1
},
'pin3' => {
'pin4' => 1
}
},
'cell' => {
'cell1' => {
'cell2' => 1
}
}
};
But let's talk about his for a moment. First, all of the matches are in scalar context since they are in the conditional parts of the if-elsif-else branches. That means they only make the next match.
However, I've anchored the start of each pattern with \G. This makes the pattern match at the beginning of the string or the position where the previous successful match left off when I use the /g flag in scalar context.
But, I want to try several patterns, so some of them are going to fail. That's where the /c flag comes in. It doesn't reset the match position on failure. That means the \G anchor won't reset on an unsuccessful match. So, I can try a pattern, and if that doesn't work, start at the same position with the next one.
So, when I encounter something in angle brackets, I remember that type. Until I match another thing in angle brackets, that's the type of thing I'm matching. Now when I match (\S+) = (\S+), I can assign the matches to the right type.
To watch this happen, you can output the remembered string position. Each scalar maintains its own cursor and pos(VAR) returns that position:
use v5.10;
use Data::Dumper;
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %hash;
while( 1 ) {
state $type;
say "Starting matches at " . ( pos($input) // 0 );
if( $input =~ /\G < (.*?) > \s* /xgc ) {
$type = $1;
say "Matched <$type>, left off at " . pos($input);
}
elsif( $input =~ /\G (\S+) = (\S+) \s* /xgc ) {
$hash{$type}{$1}{$2}++;
say "Matched <$1|$2>, left off at " . pos($input);
}
else {
say "Nothing left to do, left off at " . pos($input);
last;
}
}
print Dumper( \%hash );
Before the Dumper output, you now see the global matches in scalar context walk the string:
Starting matches at 0
Matched <cell>, left off at 7
Starting matches at 7
Matched <cell1|cell2>, left off at 19
Starting matches at 19
Matched <pin>, left off at 25
Starting matches at 25
Matched <pin1|pin2>, left off at 35
Starting matches at 35
Matched <pin3|pin4>, left off at 45
Starting matches at 45
Matched <type>, left off at 52
Starting matches at 52
Matched <type1|type2>, left off at 63
Starting matches at 63
Nothing left to do, left off at 63
Finally, as a bonus, here's a recursive decent grammar that does it. It's certainly overkill for what you've provided, but does better in more tricky situations. I won't explain it other than to say it produces the same data structure:
use v5.10;
use Parse::RecDescent;
use Data::Dumper;
my $grammar = <<~'HERE';
startrule: context_pairlist(s)
context_pairlist: context /\s*/ pair(s)
context: '<' /[^>]+/ '>'
{ $::context = $item[2] }
pair: /[A-Za-z0-9]+/ '=' /[A-Za-z0-9]+/
{ main::build_hash( $::context, #item[1,3] ) }
HERE
my $parser = Parse::RecDescent->new( $grammar );
my %hash;
sub build_hash {
my( $context, $name, $value ) = #_;
$hash{$context}{$name}{$value}++;
}
my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
$parser->startrule( $input );
say Dumper( \%hash );

You have space separated tokens. Some tokens indicate a new scope and some tokens indicate values being set in that scope. I find it most straightforward to just run through the list of tokens in this case:
#!/usr/bin/env perl
use feature 'say';
use strict;
use warnings;
my $s = q{<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2};
my (%h, $k);
while ($s =~ /(\S+)/g) {
my ($x, $y)= split /=/, $1;
if (defined $y) {
push $h{$k}->#*, {key => $x, value => $y};
next;
}
$h{$k = $x} = [];
}
use Data::Dumper;
print Dumper \%h;
Note that this method considers everything that is not an assignment to be a scope marker.
The resulting data structure is suitable for feeding into something else. Using {key => $key, 'value' => $value} instead of {$key => $value} allows 1) straightforward handling of assignments in a scope upstream; and 2) actually allows the same identifier to be assigned multiple times in a scope (giving you an opportunity to detect this if so desired):
$VAR1 = {
'<cell>' => [
{
'value' => 'cell2',
'key' => 'cell1'
}
],
'<pin>' => [
{
'value' => 'pin2',
'key' => 'pin1'
},
{
'value' => 'pin4',
'key' => 'pin3'
}
],
'<type>' => [
{
'key' => 'type1',
'value' => 'type2'
}
]
};

Another approach is to split the $pattern into an array where each tag starts a new row. This makes it easier to extract the relevant data as this example shows:
#!/usr/bin/perl
$pattern="<cell> cell1=1234567890 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
%cell=%pin=%type=();
print "Original pattern =$pattern\n";
($pattern_split=$pattern) =~ s/</\n</g;
#array=split(/\n/, $pattern_split);
# Extract relevant data (NOTE: finetune regex here) and store them in appropriate hashes indexed by $cnum (cellphone number)
for $_ (#array) {
/<cell>\s*\w+=(\w+)/ && do { $cnum = $1; $cell{$cnum} = $cnum };
/<pin>\s*(.+?)\s*$/ && do { $pin_list=$1; $pin{$cnum} = $pin_list };
/<type>\s*\w+=(\w+)/ && do { $type{$cnum} = $1 };
}
$cn="1234567890";
print "Result: Cellnumber '$cell{$cn}' has pin_list='$pin{$cn}' and type='$type{$cn}'\n";
Prints:
Original pattern =<cell> cell1=1234567890 <pin> pin1=pin2 pin3=pin4 <type> type1=type2
Result: Cellnumber '1234567890' has pin_list='pin1=pin2 pin3=pin4' and type='type2'

Related

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?

I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}

I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.

Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";

Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg

Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

How can I match these function calls, and extract the nmae of the function and the first argument?

I am trying to parse an array of elements. Those who match a pattern like the following:
Jim("jjanson", Customer.SALES);
I want to create a hash table like Jim => "jjanson"
How can I do this?
I can not match the lines using:
if($line =~ /\s*[A-Za-z]*"(.*),Customer.*\s*/)

You're not matching either the '(' after the name, nor the ' ' after the comma, before "Customer.".
I can get 'jjanson"' using this expression:
/\s*[A-Za-z]\(*"(.*), Customer.*\s*/
But I assume you don't want jjanson", so we need to modify it like so. (I tend to include the negative character class when I'm looking for simply-delimited stuff. So, in this case I'll make the expression "[^"]*"
/\s*[A-Za-z]\(*"([^"]+)", Customer.*\s*/
Also, I try not to depend upon whitespace, presence or number, I'm going to replace the space with \s*. That you didn't notice that you skipped the whitespace is a good illustration of the need to say "ignore a bunch of whitespace".
/\s*[A-Za-z]\(*"([^"]+)",\s*Customer.*\s*/
Now it's only looking for the sequence ',' + 'Customer' in the significant characters. Functionally, the same, if more flexible.
But since you only do one capture, I can't see what you'd map to what. So I'll do my own mapping:
my %records;
while ( my $line = $source->()) { # simply feed for a source of lines.
my ( $first, $user, $tag )
= $line = m/\s*(\p{Alpha}+)\s*\(\s*"([^"]+)",\s*Customer\.(\S+?)\)\/
;
$records{ $user }
= { first => $first
, username => $user
, tag => $tag
};
}
This is much more than you would tend to need in a one-off, quick solution. But I like to store as much of my input as seems significant.

Note that Jim("jjanson", Customer.SALES); matches the syntax of a function call with two arguments. You can thus abuse string eval:
#!/usr/bin/env perl
use strict;
use warnings;
use YAML::XS;
my $info = extract_first_arg(q{ Jim("jjanson", Customer.SALES);} );
print Dump $info;
sub extract_first_arg {
my $call = shift;
my ($name) = ($call =~ m{ \A \s* (\w+) }x);
unless ($name) {
warn "Failed to find function name in '$call'";
return;
}
my $username = eval sprintf(q{
package My::DangerZone;
no strict;
local *{ %s } = sub { $_[0] };
%s
}, $name, $call);
return { $name => $username };
}
Output:
---
Jim: jjanson
Or, you can abuse autoloading:
our $AUTOLOAD;
print Dump eval 'no strict;' . q{ Jim("jjanson", Customer.SALES); };
sub AUTOLOAD {
my ($fn) = ($AUTOLOAD =~ /::(\w+)\z/);
return { $fn => $_[0] };
}
I would not necessarily recommend using these methods, especially on input that is not in your control, and in a situation where this script has access to sensitive facilities.
On the other hand, I have, in the right circumstances, utilized this kind of thing in transforming one given set of information into something that can be used elsewhere.

Try this:
$line = 'Jim("jjanson", Customer.SALES)';
my %hashStore = (); #Jim("jjanson"
if($line=~m/^\s*([^\(\)]*)\(\"([^\"]*)\"/g) { $hashStore{$1} = $2; }
use Data::Dumper;
print Dumper \%hashStore;
Output:
$VAR1 = {
'Jim' => 'jjanson'
};

Perl Regex with eval and late expression doesn't seem to execute twice (unless EVAL'd)

Note: Follow-Up posted as response to this question below.
I have a subroutine that uses regular expressions to parse a string into a series of chunks, and pass those chunks back in a list. The string is packed in the format of one 32 bit integer in network order signifying the length of the item, followed by the item, repeated an unknown number of times. If one were creating this string from a list #samples, it might look like this:
my $string = pack 'NA*' x scalar #samples, map { length($_), $_ } #samples;
The basic parsing routine using Regex looks about like this:
my $parser = qr/
^ (?:
(.{4}) (?{ ($chunklen) = unpack( 'N', $^N ); })
((??{ qr~.{$chunklen}~s })) (?{ push #chunks, $^N; })
)+ $
/xs;
The first time it is invoked, it works flawlessly. The second time, and all subsequent times, it does not work at all. This example, with various points of debug output:
use strict;
use Time::HiRes qw/ time /;
use Data::Dumper;
use v5.10;
for ( 1 .. 2 ) {
print "Try $_\n";
my #samples = ( time, rand(10000), time );
my $string = pack 'NA*' x scalar #samples, map { length($_), $_ } #samples;
print Dumper parse $string;
print "\n";
}
sub parse($) {
my ( $text ) = #_;
my (#chunks, $chunklen);
my $parser = qr/
^ (?:
(.{4}) (?{ ($chunklen) = unpack( 'N', $^N ); say $chunklen })
((??{ qr~.{$chunklen}~s })) (?{ push #chunks, $^N; say $^N })
)+ $
/xs;
unless ( $text =~ $parser ) {
die 'bad parse';
}
print Dumper \#chunks;
return #chunks;
}
produces the following output
Try 1
16
1425057728.71843
16
1491.39404296875
16
1425057728.71843
$VAR1 = [
'1425057728.71843',
'1491.39404296875',
'1425057728.71843'
];
$VAR1 = '1425057728.71843';
$VAR2 = '1491.39404296875';
$VAR3 = '1425057728.71843';
Try 2
16
1425057728.71903
16
2074.27978515625
16
1425057728.71903
$VAR1 = [];
Note that #chunks somehow is not populated the second time, even though the values were found appropriately.
Why don't it run more than once?
Things I have tried that did not work:
Storing $parser as a string, and compiling when it was tested
Returning the lazy expression (??{ ... }) as a string and having it compiled later
Removing everything from a subroutine and calling it inline in the loop
The one thing I tried that did allow it to parse more than once was putting the whole thing in a quoted eval, like so:
sub parse($) { return eval q!
my ( $text ) = #_;
my (#chunks, $chunklen);
my $parser = qr/
^ (?:
(.{4}) (?{ ($chunklen) = unpack( 'N', $^N ) })
((??{ qr~.{$chunklen}~s })) (?{ push #chunks, $^N })
)+ $
/xs;
unless ( $text =~ $parser ) {
die 'bad parse';
}
return #chunks; !;
}
If we run that through the same testing body, with all the debug output removed, we see the following:
Try 1
$VAR1 = '1425058001.056';
$VAR2 = '7401.7333984375';
$VAR3 = '1425058001.056';
Try 2
$VAR1 = '1425058001.0567';
$VAR2 = '1740.1123046875';
$VAR3 = '1425058001.0567';
Note: This is a sample problem. I realize that there are non-regex based solutions to this problem, and that this problem is fairly trivial, but I hope to gain insight to apply with respect to much more complex parsers implemented similarly in Perl's Regex engine.

Inside of (?{ }) and (??{ }), don't use lexical vars declared outside the blocks. Use
local our #chunks;
local our $chunklen;
Welcome to the world of closures.
sub make_closure {
my $counter = 0;
return sub { return ++$counter };
}
my $counter1 = make_closure();
my $counter2 = make_closure();
say $counter1->(); # 1
say $counter1->(); # 2
say $counter1->(); # 3
say $counter2->(); # 1
say $counter2->(); # 2
say $counter1->(); # 4
sub { } captures lexical variables that are in scope, giving the sub access to them even when the scope in which they exist is gone.
You use this ability every day without knowing it.
my $foo = ...;
sub print_foo { print "$foo\n"; }
If subs didn't capture, and if they didn't capture at compile-time, the above wouldn't work in a module. The module's sub is likely to be called long after the module finishes executing (longer after use is complete), but that's when $foo goes away.
You get into trouble when you start placing named subs into other subs.
sub outer {
my ($x) = #_;
sub inner {
print("$x\n");
}
inner();
}
outer(5); # 5
outer(6); # 5!!!
inner captures $x that existed at compile-time. The first call to outer will reuse that lexical, but subsequent calls to outer will get a fresh $x. Perl warns Variable "$x" will not stay shared.
Similarly, (?{ }) and (??{ }) capture lexicals when they compiled. Since your patterns don't interpolate, they are compiled when the surrounding Perl code is compiled, and capture the lexicals that existed at that time. The first call to the sub will reuse those lexicals, but subsequent calls to the sub will get fresh lexicals.
Had you been listening, Perl would have warned you about the problem. Always use use strict; use warnings;.
Variable "$chunklen" will not stay shared at (re_eval 2) line 1.
Variable "$chunklen" will not stay shared at (re_eval 3) line 1.
Variable "#chunks" will not stay shared at (re_eval 4) line 1.
To solve the inner/outer sub problem, you'd replace what is effectively
BEGIN { *inner = sub { ... } }
with
local *inner = sub { ... };
to capture the lexical at run-time. In your case, you can't do that, so you switch to using package variables. Package variables aren't captured but looked up at run-time.
As an aside, all you need is
my $text = pack '(N/A)*', #samples;
and
my #samples = unpack '(N/A)*', $text;
For example,
$ perl -E'
my #samples = ( "abc", "defg", "hij" );
my $text = pack "(N/A)*", #samples;
say uc unpack "H*", $text;
#samples = unpack "(N/A)*", $text;
say for #samples;
'
0000000361626300000004646566670000000368696A
abc
defg
hij

Many thanks to #ikegami, whose answer will remain accepted because it gets at the heart of the matter. For this example, the solution was as simple as switching from my to state, and resetting the value. As #ikegami explained, when the regular expression was compiled, it captured, as a closure, the lexical variables required for it to function, i.e. that first call's values for #chunks and $chunklen. When it was called subsequent times, the regular expression ignored the new lexical variables declared with my and continued updating the variables defined in the first call.
As such, switching to state preserves those variables across all calls to the subroutine, so the regular expression is referring to the same values as the calling function. Then, all that was necessary was to reset the values on each call (otherwise, #chunks would contain not only chunks found this round, but also, all chunks found on previous calls).
use strict;
use warnings;
use Time::HiRes qw/ time /;
use Data::Dumper;
use v5.10;
sub parse($) {
my ( $text ) = #_;
# vvvv there vvvv
state ($chunklen, #chunks);
$chunklen = 0; # <-- here
#chunks = ();
# ^^^^ there ^^^^
my $parser = qr/
^ (?:
(.{4}) (?{ ($chunklen) = unpack( 'N', $^N ) })
((??{ qr~.{$chunklen}~s })) (?{ push #chunks, $^N })
)+ $
/xs;
unless ( $text =~ $parser ) {
die 'bad parse';
}
return #chunks;
}
for ( 1 .. 2 ) {
print "Try $_\n";
my #samples = ( time, rand(10000), time );
my $string = pack 'NA*' x scalar #samples, map { length($_), $_ } #samples;
print Dumper parse $string;
print "\n";
}
Produces
Try 1
$VAR1 = '1425063869.37065';
$VAR2 = '5458.984375';
$VAR3 = '1425063869.37065';
Try 2
$VAR1 = '1425063869.37124';
$VAR2 = '9147.03369140625';
$VAR3 = '1425063869.37124';
Edit: Note the importance of "use v5.10" in this example

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.

local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.

You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)

Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}

Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.

If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I grab an unknown number of captures from a pattern? - regex

Related

Dynamically capture regular expression match in Perl

How can I match these function calls, and extract the nmae of the function and the first argument?

Perl Regex with eval and late expression doesn't seem to execute twice (unless EVAL'd)

Perl all matches of a regexp in a given string

How to automagically create pattern based on real data?

Categories

Resources