Perl regex vs. Raku regex, differences in the engine? - regex

I am trying to convert a regex based solution for the knapsack problem from Perl to raku. Details on Perlmonks
The Perl solution creates this regex:
(?<P>(?:vvvvvvvvvv)?)
(?<B>(?:vv)?)
(?<Y>(?:vvvv)?)
(?<G>(?:vv)?)
(?<R>(?:v)?)
0
(?=
(?(?{ $1 })wwww|)
(?(?{ $2 })w|)
(?(?{ $3 })wwwwwwwwwwww|)
(?(?{ $4 })ww|)
(?(?{ $5 })w|)
)
which gets matched against vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. After that the match hash %+ contains the items to put in the sack.
My raku conversion is:
$<B> = [ [ vv ]? ]
$<P> = [ [ vvvvvvvvvv ]? ]
$<R> = [ [ v ]? ]
$<Y> = [ [ vvvv ]? ]
$<G> = [ [ vv ]? ]
0
<?before
[ { say "B"; say $/<B>; say $0; say $1; $1 } w || { "" } ]
[ { say "P"; say $/<P>; say $0; say $1; $2 } wwww || { "" } ]
[ { say "R"; say $/<R>; say $0; say $1; $3 } w || { "" } ]
[ { say "Y"; say $/<Y>; say $0; say $1; $4 } wwwwwwwwwwww || { "" } ]
[ { say "G"; say $/<G>; say $0; say $1; $5 } ww || { "" } ]
which also matches vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. But the match object, $/ does not contain anything useful. Also, my debug says all say Nil, so at that point the backreference does not seem to work?
Here's my test script:
my $max-weight = 15;
my %items =
'R' => { w => 1, v => 1 },
'B' => { w => 1, v => 2 },
'G' => { w => 2, v => 2 },
'Y' => { w => 12, v => 4 },
'P' => { w => 4, v => 10 }
;
my $str = 'v' x %items.map(*.value<v>).sum ~
'0' ~
'w' x $max-weight;
say $str;
my $i = 0;
my $left = my $right = '';
for %items.keys -> $item-name
{
my $v = 'v' x %items{ $item-name }<v>;
my $w = 'w' x %items{ $item-name }<w>;
$left ~= sprintf( '$<%s> = [ [ %s ]? ] ' ~"\n", $item-name, $v );
$right ~= sprintf( '[ { say "%s"; say $/<%s>; say $0; say $1; $%d } %s || { "" } ]' ~ "\n", $item-name, $item-name, ++$i, $w );
}
use MONKEY-SEE-NO-EVAL;
my $re = sprintf( '%s0' ~ "\n" ~ '<?before ' ~ "\n" ~ '%s>' ~ "\n", $left, $right );
say $re;
dd $/ if $str ~~ m:g/<$re>/;

This answer only covers what's going wrong. It does not address a solution. I have not filed corresponding bugs. I have not yet even searched bug queues to see if I can find reports corresponding to either or both the two issues I've surfaced.
my $lex-var;
sub debug { .say for ++$, :$<rex-var>, :$lex-var }
my $regex = / $<rex-var> = (.) { $lex-var = $<rex-var> } <?before . { debug }> / ;
'xx' ~~ $regex; say $/;
'xx' ~~ / $regex /; say $/;
displays:
1
rex-var => Nil
lex-var => 「x」
「x」
rex-var => 「x」
2
rex-var => Nil
lex-var => 「x」
「x」
Focusing first on the first call of debug (the lines starting with 1 and ending at rex-var => 「x」), we can see that:
Something's gone awry during the call to debug: $<rex-var> is reported as having the value Nil.
When the regex match is complete and we return to the mainline, the say $/ reports a full and correctly populated result that includes the rex-var named match.
To begin to get a sense of what's gone wrong, please consider reading the bulk of my answer to another SO question. You can safely skip the Using ~. Footnotes 1,2, and 6 are also probably completely irrelevant to your scenario.
For the second match, we see that not only is $<rex-var> reported as being Nil during the debug call, the final match variable, as reported back in the mainline with the second say $/, is also missing the rex-var match. And the only difference is that the regex $regex is called from within an outer regex.

Related

Why do #+ and #{^CAPTURE} differ in length?

I'm trying to understand how the regex variables work, so I can save submatch positions in the payload within embedded code expressions. According to perlvar, the positive indices of the array correspond to $1, $2, $3, etc., but that doesn't seem to be the case?
#!/usr/bin/perl -w
use v5.28;
use Data::Dumper;
"XY" =~ / ( (.*) (.) (?{
say Dumper { match_end => \#+ };
say Dumper { capture => \#{^CAPTURE} }
}) ) (.)/x;
Output:
$VAR1 = {
'match_end' => [
2,
undef,
1,
2,
undef
]
};
$VAR1 = {
'capture' => [
undef,
'X',
'Y'
]
};
$VAR1 = {
'match_end' => [
1,
2,
0,
1,
undef
]
};
$VAR1 = {
'capture' => [
'XY',
'',
'X'
]
};
The #+ array apparently gets allocated, or otherwise prepared, already at compilation
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) )/x'
prints
(0, undef, undef)
(0 for the whole match and an undef for each indicated capture group), while
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) (.) )/x'
prints
(0, undef, undef, undef)
with one more element for one more capture group.
One the other hand, the #{^CAPTURE} is just plain empty until there are actual patterns to capture, as we can see from mob's detailed analysis. This, I'd say, plays well with its name.
After the fact the arrays agree, with that shift of one in indices since #+ also contains (offset for) the whole match, at $+[0].
Another difference is that a trailing failed optional match doesn't get a slot in #{^CAPTURE}
perl -MData::Dump=dd -we'$_=q(abc); /((x)? (.) (x)?)/x; dd #+; dd #{^CAPTURE}'
prints
(1, 1, undef, 1, undef)
("a", undef, "a")
The perlvar docs are unclear about what #{^CAPTURE} look like in the middle of a regexp evaluation, but there is a clear progression that depends where in the regexp you are looking at it.
use 5.026;
use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0;
sub DEBUG_CAPTURE { say Dumper { a => $_[0], capture => \#{^CAPTURE} }; }
"XY" =~ /
(?{DEBUG_CAPTURE(0)})
(
(?{DEBUG_CAPTURE(1)})
(
(?{DEBUG_CAPTURE(2)})
(.*) (?{DEBUG_CAPTURE(3)})
(.) (?{DEBUG_CAPTURE(4)})
)
(?{DEBUG_CAPTURE(5)}) (.)
(?{DEBUG_CAPTURE(6)})
)
(?{DEBUG_CAPTURE(7)}) /x;
DEBUG_CAPTURE(8);
Output
$VAR1 = {'a' => 0,'capture' => []};
$VAR1 = {'a' => 1,'capture' => []};
$VAR1 = {'a' => 2,'capture' => []};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'XY']};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'X']};
$VAR1 = {'a' => 4,'capture' => [undef,undef,'X','Y']};
$VAR1 = {'a' => 5,'capture' => [undef,'XY','X','Y']};
$VAR1 = {'a' => 3,'capture' => [undef,'XY','','Y']};
$VAR1 = {'a' => 4,'capture' => [undef,'XY','','X']};
$VAR1 = {'a' => 5,'capture' => [undef,'X','','X']};
$VAR1 = {'a' => 6,'capture' => [undef,'X','','X','Y']};
$VAR1 = {'a' => 7,'capture' => ['XY','X','','X','Y']};
$VAR1 = {'a' => 8,'capture' => ['XY','X','','X','Y']};
The docs are correct if you are observing #{^CAPTURE} after a regexp has been completely evaluated. While evaluation is in process, #{^CAPTURE} seems to grow as the number of capture groups encountered increases. But it's not clear how useful it is to look at #{^CAPTURE} at least until you get to the end of the expression.

Select characters that appear only once in a string

Is it possible to select characters who appear only once?
I am familiar with negative look-behind, and tried the following
/(.)(?<!\1.*)/
but could not get it to work.
examples:
given AXXDBD it should output ADBD
^^ - this is unacceptable
given 123558 it should output 1238
^^ - this is unacceptable
thanks in advance for the help
There are probably a lot of approaches to this, but I think you're looking for something like
(.)\1{1,}
That is, any character followed by the same character at least once.
Your question is tagged with both PHP and JS, so:
PHP:
$str = preg_replace('/(.)\1{1,}/', '', $str);
JS:
str = str.replace(/(.)\1{1,}/g, '');
Without using a regular expression:
function not_twice ($str) {
$str = (string)$str;
$new_str = '';
$prev = false;
for ($i=0; $i < strlen($str); $i++) {
if ($str[$i] !== $prev) {
$new_str .= $str[$i];
}
$prev = $str[$i];
}
return $new_str;
}
Removes consecutives characters (1+) and casts numbers to string in case you need that too.
Testing:
$string = [
'AXXDBD',
'123558',
12333
];
$string = array_map('not_twice', $string);
echo '<pre>' . print_r($string, true) . '</pre>';
Outputs:
Array
(
[0] => AXDBD
[1] => 12358
[2] => 123
)

Perl 5 - longest token matching in regexp (using alternation)

Is possible to force a Perl 5 regexp match longest possible string, if the regexp is, for example:
a|aa|aaa
I found is probably default in perl 6, but in perl 5, how i can get this behavior?
EXAMPLE pattern:
[0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9]
If I have string 2.10.2014, then first match will be 2, which is ok; but the next match will be 1, and this is not ok because it should be 10. Then 2014 will be 4 subsequently matches 2,0,1,4, but it should be 2014 using [0-9][0-9][0-9][0-9]. I know I could use [0-9]+, but I can't.
General solution: Put the longest one first.
my ($longest) = /(aaa|aa|a)/
Specific solution: Use
my ($longest) = /([0-9]{4}|[0-9]{1,2})/
If you can't edit the pattern, you'll have to find every possibility and find the longest of them.
my $longest;
while (/([0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9])/g) {
$longest = $1 if length($1) > length($longest);
}
The sanest solution I can see for unknown patterns is to match every possible pattern, look at the length of the matched substrings and select the longest substring:
my #patterns = (qr/a/, qr/a(a)/, qr/b/, qr/aaa/);
my $string = "aaa";
my #substrings = map {$string =~ /($_)/; $1 // ()} #patterns;
say "Matched these substrings:";
say for #substrings;
my $longest_token = (sort { length $b <=> length $a } #substrings)[0];
say "Longest token was: $longest_token";
Output:
Matched these substrings:
a
aa
aaa
Longest token was: aaa
For known patterns, one would sort them manually so that first-match is the same as longest-match:
"aaa" =~ /(aaa|aa|b|a)/;
say "I know that this was the longest substring: $1";
The alternation will use the first alternative that matches, so just write /aaa|aa|a/ instead.
For the example you have shown in your question, just put the longest alternative first like I said:
[0-9][0-9][0-9][0-9]|[0-9][0-9]|[0-9]
perl -Mstrict -Mre=/xp -MData::Dumper -wE'
{package Data::Dumper;our($Indent,$Sortkeys,$Terse,$Useqq)=(1)x4}
sub _dump { Dumper(shift) =~ s{(\[.*?\])}{$1=~s/\s+/ /gr}srge }
my ($count, %RS);
my $s= "aaaabbaaaaabbab";
$s =~ m{ \G a+b? (?{ $RS{ $+[0] - $-[0] } //= [ ${^MATCH}, $-[0] ]; $count++ }) (*FAIL) };
say sprintf "RS: %s", _dump(\%RS);
say sprintf "count: %s", $count;
'
RS: {
"1" => [ "a", 0 ],
"2" => [ "aa", 0 ],
"3" => [ "aaa", 0 ],
"4" => [ "aaaa", 0 ],
"5" => [ "aaaab", 0 ]
}
count: 5

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.

String matching and extraction

I have a string like "ab.cde.fg.hi", and I want to split it into two strings.
"ab.cde.fg"
".hi"
How to do so? I got some code written that will get me the 2nd string but how do I retrieve the remaining?
$mystring = "ab.cde.fg";
$mystring =~ m/.*(\..+)/;
print "$1\n";
my ($first, $second) = $string =~ /(.*)(\..*)/;
You can also use split:
my ($first, $second) = split /(?=\.[^.]+$)/, $string;
Are you sure you aren’t looking for...
($name,$path,$suffix) = File::Basename::fileparse($fullname,#suffixlist);
my #parts = /(.*)\.(.*)/s;
my #parts = split /\.(?!.*\.)/s;
my #parts = split /\.(?=[^.]*\z)/s;
Update: I misread. The "." should be included in the second part, but it's not in the above. The above should be:
my #parts = /(.*)(\..*)/s;
my #parts = split /(?=\.(?!.*\.))/s;
my #parts = split /(?=\.[^.]*\z)/s;
To promote my idea to use rindex to get
1) "ab.cde.fg"
2) ".hi"
from "ab.cde.fg.hi", I wrote this script to make experiments easier:
use strict;
use diagnostics;
use warnings;
use English;
my #tests = (
[ 'ab.cde.fg.hi', 'ab.cde.fg|.hi' ]
, [ 'abxcdexfg.hi', 'abxcdexfg|.hi' ]
);
for my $test (#tests) {
my $src = $test->[0];
my $exp = $test->[1];
printf "-----|%s| ==> |%s|-----\n", $src, $exp;
for my $op (
[ 'ikegami 1' , sub { shift =~ /(.*)\.(.*)/s; } ]
, [ 'ikegami 2' , sub { split( /\.(?!.*\.\z)/s, shift) } ]
, [ 'rindex' , sub { my $p = rindex( $_[0], '.' );
( substr($_[0], 0, $p)
, substr($_[0], $p)
); }
]
) {
my ($head, $tail) = $op->[1]( $src );
my $res = join '|', ($head, $tail);
my $ok = $exp eq $res ? 'ok' : "fail: $exp expected.";
printf "%12s: %-20s => %-20s : %s\n", $op->[0], $src, $res, $ok;
}
}
output:
-----|ab.cde.fg.hi| ==> |ab.cde.fg|.hi|-----
ikegami 1: ab.cde.fg.hi => ab.cde.fg|hi : fail: ab.cde.fg|.hi expected.
ikegami 2: ab.cde.fg.hi => ab|cde : fail: ab.cde.fg|.hi expected.
rindex: ab.cde.fg.hi => ab.cde.fg|.hi : ok
-----|abxcdexfg.hi| ==> |abxcdexfg|.hi|-----
ikegami 1: abxcdexfg.hi => abxcdexfg|hi : fail: abxcdexfg|.hi expected.
ikegami 2: abxcdexfg.hi => abxcdexfg|hi : fail: abxcdexfg|.hi expected.
rindex: abxcdexfg.hi => abxcdexfg|.hi : ok