Arithmetic Calculation in Perl Substitute Pattern Matching - regex

Using just one Perl substitute regular expression statement (s///), how can we write below:
Every success match contains just a string of Alphabetic characters A..Z. We need to substitute the match string with a substitution that will be the sum of character index (in alphabetical order) of every character in the match string.
Note: For A, character index would be 1, for B, 2 ... and for Z would be 26.
Please see example below:
success match: ABCDMNA
substitution result: 38
Note:
1 + 2 + 3 + 4 + 13 + 14 + 1 = 38;
since
A = 1, B = 2, C = 3, D = 4, M = 13, N = 14 and A = 1.

I will post this as an answer, I guess, though the credit for coming up with the idea should go to abiessu for the idea presented in his answer.
perl -ple'1 while s/(\d*)([A-Z])/$1+ord($2)-64/e'
Since this is clearly homework and/or of academic interest, I will post the explanation in spoiler tags.
- We match an optional number (\d*), followed by a letter ([A-Z]). The number is the running sum, and the letter is what we need to add to the sum.
- By using the /e modifier, we can do the math, which is add the captured number to the ord() value of the captured letter, minus 64. The sum is returned and inserted instead of the number and the letter.
- We use a while loop to rinse and repeat until all letters have been replaced, and all that is left is a number. We use a while loop instead of the /g modifier to reset the match to the start of the string.

Just split, translate, and sum:
use strict;
use warnings;
use List::Util qw(sum);
my $string = 'ABCDMNA';
my $sum = sum map {ord($_) - ord('A') + 1} split //, $string;
print $sum, "\n";
Outputs:
38

Can you use the /e modifier in the substitution?
$s = "ABCDMNA";
$s =~ s/(.)/$S += ord($1) - ord "#"; 1 + pos $s == length $s ? $S : ""/ge;
print "$s\n"

Consider the following matching scenario:
my $text = "ABCDMNA";
my $val = $text ~= s!(\d)*([A-Z])!($1+ord($2)-ord('A')+1)!gr;
(Without having tested it...) This should repeatedly go through the string, replacing one character at a time with its ordinal value added to the current sum which has been placed at the beginning. Once there are no more characters the copy (/r) is placed in $val which should contain the translated value.

Or an short alternative:
echo ABCDMNA | perl -nlE 'm/(.)(?{$s+=-64+ord$1})(?!)/;say$s'
or readable
$s = "ABCDMNA";
$s =~ m/(.)(?{ $sum += ord($1) - ord('A')+1 })(?!)/;
print "$sum\n";
prints
38
Explanation:
trying to match any character what must not followed by "empty regex". /.(?!)/
Because, an empty regex matches everything, the "not follow by anything", isn't true ever.
therefore the regex engine move to the next character, and tries the match again
this is repeated until is exhausted the whole string.
because we want capture the character, using capture group /(.)(?!)/
the (?{...}) runs the perl code, what sums the value of the captured character stored in $1
when the regex is exhausted (and fails), the last say $s prints the value of sum
from the perlre
(?{ code })
This zero-width assertion executes any embedded Perl code. It always
succeeds, and its return value is set as $^R .
WARNING: Using this feature safely requires that you understand its
limitations. Code executed that has side effects may not perform
identically from version to version due to the effect of future
optimisations in the regex engine. For more information on this, see
Embedded Code Execution Frequency.

Related

Sliding window pattern match in perl or matlab regular expressions

I am trying to use either Perl or MATLAB to parse a few numbers out of a single line of text. My text line is:
t10_t20_t30_t40_
now in matlab, i used the following script
str = 't10_t20_t30_t40_';
a = regexp(str,'t(\d+)_t(\d+)','match')
and it returns
a =
't10_t20' 't30_t40'
What I want is for it to also return 't20_t30', since this obviously is a match. Why doesn't regexp scan it?
I thus turned to Perl, and wrote the following in Perl:
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ /(t\d+_t\d+)/g)
{
print "$1\n";
}
and the result is the same as matlab
t10_t20
t30_t40
but I really wanted "t20_t30" also be in the results.
Can anyone tell me how to accomplish that? Thanks!
[update with a solution]:
With help from colleagues, I identified a solution using the so-called "look-around assertion" afforded by Perl.
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ m/(?=(t\d+_t\d+))/g)
{print "$1\n";}
The key is to use "zero width look-ahead assertion" in Perl. When Perl (and other similar packages) uses regexp to scan a string, it does not re-scan what was already scanned in the last match. So in the above example, t20_t30 will never show up in the results. To capture that, we need to use a zero-width lookahead search to scan the string, producing matches that do not exclude any substrings from subsequent searches (see the working code above). The search will start from zero-th position and increment by one as many times as possible if "global" modifier is appended to the search (i.e. m//g), making it a "greedy" search.
This is explained in more detail in this blog post.
The expression (?=t\d+_t\d+) matches any 0-width string followed by t\d+_t\d+, and this creates the actual "sliding window". This effectively returns ALL t\d+_t\d+ patterns in $str without any exclusion since every position in $str is a 0-width string. The additional parenthesis captures the pattern while its doing sliding matching (?=(t\d+_t\d+)) and thus returns the desired sliding window outcome.
Using Perl:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $re = qr/(?=(t\d+_t\d+))/;
my #l = 't10_t20_t30_t40' =~ /$re/g;
say Dumper(\#l);
Output:
$VAR1 = [
't10_t20',
't20_t30',
't30_t40'
];
Once the regexp algorithm has found a match, the matched characters are not considered for further matches (and usually, this is what one wants, e.g. .* is not supposed to match every conceivable contiguous substring of this post). A workaround would be to start the search again one character after the first match, and collect the results:
str = 't10_t20_t30_t40_';
sub_str = str;
reg_ex = 't(\d+)_t(\d+)';
start_idx = 0;
all_start_indeces = [];
all_end_indeces = [];
off_set = 0;
%// While there are matches later in the string and the first match of the
%// remaining string is not the last character
while ~isempty(start_idx) && (start_idx < numel(str))
%// Calculate offset to original string
off_set = off_set + start_idx;
%// extract string starting at first character after first match
sub_str = sub_str((start_idx + 1):end);
%// find further matches
[start_idx, end_idx] = regexp(sub_str, reg_ex, 'once');
%// save match if any
if ~isempty(start_idx)
all_start_indeces = [all_start_indeces, start_idx + off_set];
all_end_indeces = [all_end_indeces, end_idx + off_set];
end
end
display(all_start_indeces)
display(all_end_indeces)
matched_strings = arrayfun(#(st, en) str(st:en), all_start_indeces, all_end_indeces, 'uniformoutput', 0)

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

Is there a way to evaluate the number of times a Perl regular expression has matched?

I've been poring over perldoc perlre as well as the Regular Expressions Cookbook and related questions on Stack Overflow and I can't seem to find what appears to be a very useful expression: how do I know the number of current match?
There are expressions for the last closed group match ($^N), contents of match 3 (\g{3} if I understood the docs correctly), $', $& and $`. But there doesn't seem to be a variable I can use that simply tells me what the number of the current match is.
Is it really missing? If so, is there any explained technical reason why it is a hard thing to implement, or am I just not reading the perldoc carefully enough?
Please note that I'm interested in a built-in variable, NOT workarounds like using (${$count++}).
For context, I'm trying to build a regular expression that would match only some instances of a match (e.g. match all occurrences of character "E" but do NOT match occurrences 3, 7 and 10 where 3, 7 and 10 are simply numbers in an array). I ran into this when trying to construct a more idiomatic answer to this SO question.
I want to avoid evaluating regexes as strings to actually insert 3, 7 and 10 into the regex itself.
I'm completely ignoring the actually utility or wisdom of using this for the other question.
I thought #- or #+ might do what you want since they hold the offsets of the numbered matches, but it looks like the regex engine already knows what the last index will be:
use v5.14;
use Data::Printer;
$_ = 'abc123abc345abc765abc987abc123';
my #matches = m/
([0-9]+)
(?{
print 'Matched \$' . $#+ . " group with $^N\n";
say p(#+);
})
.*?
([0-9]+)
(?{
print 'Matched \$' . $#+ . " group with $^N\n";
say p(#+);
})
/x;
say "Matches: #matches";
This gives strings that show the last index as 2 even though it hasn't matched $2 yet.
Matched \$2 group with 123
[
[0] 6,
[1] 6,
[2] undef
]
Matched \$2 group with 345
[
[0] 12,
[1] 6,
[2] 12
]
Matches: 123 345
Notice that the first time around, $+[2] is undef, so that one hasn't been filled in yet. You might be able to do something with that, but I think that's probably getting away from the spirit of your question. If you were really fancy, you could create a tied scalar that has the value of the last defined index in #+, I guess.
I played around with this for a bit. Again, I know that this is not really what you are looking for, but I don't think that exists in the way you want it.
I had two thoughts. First, with a split using separator retention mode, you get the interstitial bits as the odd numbered elements in the output list. With the list from the split, you count which match you are on and put it back together how you like:
use v5.14;
$_ = 'ab1cdef2gh3ij4k5lmn6op7qr8stu9vw10xyz';
my #bits = split /(\d+)/; # separator retention mode
my #skips = qw(3 7 10);
my $s;
while( my( $index, $value ) = each #bits ) {
# shift indices to match number ( index = 2 n - 1 )
if( $index % 2 and ! ( ( $index + 1 )/2 ~~ #skips ) ) {
$s .= '^';
}
else {
$s .= $value;
}
}
I get:
ab^cdef^gh3ij^k^lmn^op7qr^stu^vw10xyz
I thought I really liked my split answer until I had the second thought. Does state work inside a substitution? It appears that it does:
use v5.14;
$_ = 'ab1cdef2gh3ij4k5lmn6op7qr8stu9vw10xyz';
my #skips = qw(3 7 10);
s/(\d+)/
state $n = 0;
$n++;
$n ~~ #skips ? $1 : '$'
/eg;
say;
This gives me:
ab$cdef$gh3ij$k$lmn$op7qr$stu$vw10xyz
I don't think you can get much simpler than that, even if that magic variable existed.
I had a third thought which I didn't try. I wonder if state works inside a code assertion. It might, but then I'd have to figure out how to use one of those to make a match fail, which really means it has to skip over the bit that might have matched. That seems really complicated, which is probably what Borodin was pressuring you to show even in pseudocode.

Count the number of matches of a particular character in a string matched by a regex wildcard

Can I keep a count of each different character matched in the regex itself ?
Suppose the regex goes looks like />(.*)[^a]+/
Can I keep a count of the occurrences of, say the letter p in the string captured by the group (.*)?
You would have to capture the string matched and process it separately.
This code demonstrates
use strict;
use warnings;
my $str = '> plantagenetgoosewagonattributes';
if ($str =~ />(.*)[^a]+/) {
my $substr = $1;
my %counts;
$counts{$_}++ for $substr =~ /./g;
print "'$_' - $counts{$_}\n" for sort keys %counts;
}
output
' ' - 1
'a' - 4
'b' - 1
'e' - 4
'g' - 3
'i' - 1
'l' - 1
'n' - 3
'o' - 3
'p' - 1
'r' - 1
's' - 1
't' - 5
'u' - 1
'w' - 1
Outside of the regex :
my $p_count = map /p/g, />(.*)[^a]/;
Self-contained:
local our $p_count;
/
(?{ 0 })
>
(?: p (?{ $^R + 1 })
| [^p]
)*
[^a]
(?{ $p_count = $^R; })
/x;
In both cases, you can easily expand this to count all letters. For example,
my %counts;
if (my ($seq = />(.*)[^a]/) {
++$counts{$_} for split //, $seq;
}
my $p_count = $counts{'p'};
AFAIK, you can't. You can only capture some group by parentheses and later check the length of data captured by that group.
Going along the lines of Borodin's solution , here is a pure bash one :
let count=0
testarray=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
string="> plantagenetgoosewagonattributes" # the string
pattern=">(.*)[^a]+" # regex pattern
limitvar=${#testarray[#]} #array length
[[ $string =~ $pattern ]] &&
( while [ $count -lt $limitvar ] ; do sub="${BASH_REMATCH[1]//[^${testarray[$count]}]}" ; echo "${testarray[$count]} = ${#sub}" ; ((count++)) ; done )
Staring from bash 3.0 , bash has introduced the capture groups which can be accessed through BASH_REMATCH[n].
The Solution declares the characters to be counted as arrays [ Check out declare -a for array declaraton in complex cases] .A single character count would require no count variables ,no while construct but a variable for the character instead of an array .
If you are including ranges as in the code above , this array declaration does the exact thing .
testarray=(`echo {a..z}`)
An introduction of an if
loop will account for the display of 0 count characters . I wanted to keep the solution as simple as possible .
There is the experimental, don't-use-me, (?{ code }) construct...
From man perlre:
"(?{ code })"
WARNING: This extended regular expression feature is considered experimental, and may be
changed without notice. Code executed that has side effects may not perform identically
from version to version due to the effect of future optimisations in the regex engine.
If that didn't scare you off, here's an example that counts the number of "p"s
my $p_count;
">pppppbca" =~ /(?{ $p_count = 0 })>(p(?{$p_count++})|.)*[^a]+/;
print "$p_count\n";
First a remark: Due to the greediness of *, the last [^a]+ will never match more than one non-a character -- i.e., you might as well drop the +.
And as #mvf said, you need to capture the string that the wildcard matches to be able to count the characters in it. Perl regular expressions do not have a way to return a count of how many times a specific group matches -- the engine probably keeps the number around to support the {,n} mechanism, but you can't get at it.

Regex to match the longest repeating substring

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match
In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101
You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']
Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919
Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";
Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'