Same regex doesn't match twice

Same regex doesn't match twice - regex

Trying to solve a problem in my perl script I finally could break it down to this situation:
my $content = 'test';
if($content =~ m/test/g) {
print "1\n";
}
if($content =~ m/test/g) {
print "2\n";
}
if($content =~ m/test/g) {
print "3\n";
}
Output:
1
3
My real case is just a bit different but at the end it's the same thing: I'm confused why regex 2 isn't matching. Does anyone has an explanation for this? I realized that /g seems to be the reason and of course this is not needed in my example. But (why) is this output normal behaviour?

This is exactly what /g in scalar context is supposed to do.
The first time it matches "test". The second match tries to start matching in the string after where the previous match left off, and fails. The third match then tries again from the beginning of the string (and succeeds) because the second match failed and you didn't also specify /c.
(/c keeps it from restarting at the beginning if a match fails; if your second match was /test/gc, the second and third match would both fail.)

Generally speaking, if (/.../g) makes no sense and should be replaced with if (/.../)[1].
You wouldn't expect the following to match twice:
my $content = "test";
while ($content =~ /test/g) {
print(++$i, "\n");
}
So why would you expect the following to match twice:
my $content = "test";
if ($content =~ /test/g) {
print(++$i, "\n");
}
if ($content =~ /test/g) {
print(++$i, "\n");
}
They're the same!
Let's imagine $content contains testtest.
The 1st time $content =~ /test/g is evaluated in scalar context,it matches the first test.
The 2nd time $content =~ /test/g is evaluated in scalar context,it matches the second test.
The 3rd time $content =~ /test/g is evaluated in scalar context,it returns false to indicate there are no more matches.This also resets the position at which $content future matches will start.
The 4th time $content =~ /test/g is evaluated in scalar context,it matches the first test.
...
There are advanced uses for if (/\G.../gc), but that's different. if (/.../g) only makes sense if you're unrolling a while loop. (e.g. while (1) { ...; last if !/.../g; ... }).

Related

Reversing a string in perl without using "reverse" function

I was looking for clues on how to reverse a string in Perl without using the builtin reverse function and came across the following piece of code for reversing $str.
print +($str =~ /./g)[-$_] for (1 .. $#{[$str =~ /./g]} + 1);
I was trying to understand how this works a bit more and expanded the above code to something like this.
for (1 .. $#{[$str =~ /./g]} + 1) {
$rev_str_1 = ($str =~ /./g)[-$_];
print $rev_str_1;
}
The above code snippet also works fine. But, the problem comes when I add any print inside the for loop to understand how the string manipulation is working.
for (1 .. $#{[$str =~ /./g]} + 1) {
$rev_str_1 = ($str =~ /./g)[-$_];
print "\nin loop now ";
print $rev_str_1;
}
For input string of stressed, following is the output for above code
in loop now d
in loop now e
in loop now s
in loop now s
in loop now e
in loop now r
in loop now t
in loop now s
It seems like the entire string reversal is happening in this part ($str =~ /./g)[-$_] but I am trying to understand why is it not working when I add an extra print. Appreciate any pointers.

You're assuming that the string is reversed before being printed, but the program just prints all the characters in the string one at a time in reverse order
Here's how it works
It's based around the expression $str =~ /./g which uses a global regex match with a pattern that matches any single character. In list context it returns all the characters in the string as a list. Note that a dot . without the /s pattern modifier doesn't match linefeed. That's a bug, but probably isn't critical in this situation
This expression
$#{ [ $str =~ /./g ] } + 1
creates an anonymous array of the characters in $str with [ $str =~ /./g ]. Then uses $# to get the index of the last element of the array, and adds 1 to get the total number of characters (because the index is zero-based). So the loop is executing with $_ in the range 1 to the number of characters in $str. This is unnecessarily obscure and should probably be written 1 .. length($str) except for the special case of linefeed characters mentioned above
The body of the loop uses ($str =~ /./g)[-$_], which splits $str into characters again in the same way as before, and then uses the fact that negative indexes in Perl refer to elements relative to the end of the array or list. So the last character in $str is at index -1, the second to last at index -2 and so on. Again, this is unnecessarily arcane; the expression is exactly equivalent to substr($str, -$_, 1), again with the exception that the regex version ignores linefeed characters
Printing the characters one at a time like this results in $str being printed in reverse
It may be easier to understand if the string is split into a real array, and the reversed string is accumulated into a buffer, like this
my $reverse = '';
my #str = $str =~ /./sg;
for ( 1 .. #str ) {
$reverse .= $str[-$_];
}
print $reverse, "\n";
Or, using length and substr as described above, this is equivalent to
my $reverse = '';
$reverse .= substr($str, -$_, 1) for 1 .. length($str);
print $reverse, "\n";

Performing two (clashing) interpolations on one string

I have an interpolate function that replaces %foo with the value from $HV{'default'}{'foo'} and %foo.bar from $HV{foo}{bar}:
sub interpolate {
my $work = "#_";
$work =~ s/\%(\w+)\.(\w+)/$HV{$1}{$2}/g;
$work =~ s/\%(\w+)/$HV{'default'}{$1}/g;
return $work;
}
However if $HV{'foo'}{'bar'} contains a % character, the second operation matches it which is not what I want. My first fix was to change all occurrences of %foo into %default.foo with
$work =~ s/\%(\w+)/%default\.$1/g;
$work =~ s/\%(\w+)\.(\w+)/$HV{$1}{$2}/g;
But this changes %foo.bar into %default.foo.bar. Is there a way to do what I want without re-doing my hash?
Also for bonus credit I'd be interested in a regular expression that would match %A.very.long.and.deeply.nested.hash.value with the corresponding value to make it work with any hash.

The easiest solution is to do a single traversal of the string, not two in a row:
$work =~ s{%(\w+)(?:\.(\w+))?}{
defined $2
? $HV{$1}{$2}
: $HV{default}{$1}
}eg;
To fix your other approach, you could change your regex to
$work =~ s/%(\w+)(?!\.\w)/%default.$1/g;
to only replace %foo if it's not followed by .bar.
Bonus credit: Assuming you want to replace %foo.bar.baz by $HV{foo}{bar}{baz}, this can be done as follows:
sub lookup {
my ($cur, #keys) = #_;
$cur = $cur->{$_} for #keys;
return $cur;
}
s{%(\w+(?:\.\w+)*)}{
lookup(\%HV, split(/\./, $1))
}eg;

regular expression for matching a string

I'm trying to remove a part of a given string using the either of the two rules:
Eliminate all the consonant(s) at the beginning of a string
Eliminate all but the consonants at the beginning of a string.
Suppose my string is str. Is ${str%%[aeoui]{1}*} correct for the second rule? I'm not sure what to do for the first rule.

I'm not sure what language you are trying to implement this in, so I'll just use some generic syntax.
1. s/^[^aeiouAEIOU]*(.*)/\1/
2. s/^[aeiouAEIOU]*(.*)/\1/
There are ways to make it case insensitive, but I like being specific like this just for clarity.
The only difference between the two is ^ inside the [] in #1 which just negates it.
* means zero or more. If you use +, for instance, there would have to be at least one consonant in #1 and at least one vowel in #2 or the test would fail.
In my generic syntax here \1 returns what was found by (.*).
Here's some very crude Perl to demonstrate (where $1 in the print statements behaves as \1 in my example above):
#!/usr/bin/perl
$string1="abcdef";
$string2="fedcba";
if ($string1 =~ /^[aeiouAEIOU]*(.*)/) {
print "Test 1 on $string1: $1\n";
}
if ($string2 =~ /^[aeiouAEIOU]*(.*)/) {
print "Test 1 on $string2: $1\n";
}
if ($string1 =~ /^[^aeiouAEIOU]*(.*)/) {
print "Test 2 on $string1: $1\n";
}
if ($string2 =~ /^[^aeiouAEIOU]*(.*)/) {
print "Test 2 on $string2: $1\n";
}
And here's the output:
Test 1 on abcdef: bcdef
Test 1 on fedcba: fedcba
Test 2 on abcdef: abcdef
Test 2 on fedcba: edcba

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?

Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}

Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.

Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.

If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

Break from regex loop in Perl

In Perl regex, how can I break from /ge loop..?
Let's say the code is:
s/\G(foo)(bar)(;|$)/{ break if $3 ne ';'; print "$1\n"; '' }/ge;
...break here doesn't work, but it should illustrate what I mean.

Generally, I would write this as a while statement:
while( s/(foo)(bar)/$1/ ) {
# my code to determine if I should stop
if(something) {
last;
}
}
The caveat with this method is that your search/replace will start at the beginning each time, which may matter depending on your regex.
If you really wanted to do it in the regex, you could write a function that returns an unmodified string if you reached your end point, such as a count in this case:
my $count=0;
sub myfunc {
my ($string, $a, $b) = #_;
$count++;
if($count > 3) {
return $string;
}
return $a;
}
$mystring = "foobar foobar, foobar + foobar and foobar";
$mystring =~ s/((foo)(bar))/myfunc($1,$2,$3)/ge;
# result: $mystring => "foo foo, foo + foobar and foobar"
If I knew your specific case, I could probably provide a more helpful example.

You can use some experimental features to emulate a break statement, the Perl documentation for some of these features warn that they may change in future versions of Perl.
my $str = "abcdef";
my $stop = 0;
$str =~ s/(?(?{ $stop })(?!))(.)/ $stop = 1 if $1 ge "c"; "X" /ge;
print "$str\n";
This will print XXXdef.
A piece wise explanation:
(?(condition)yes-pattern) if the pattern in in condition matches then match yes-pattern, otherwise don't match anything.
(?{ code }) execute code, inside a conditional if the code is true execute the yes-pattern
(?!) will always fail to match, it's meaning is something like "Don't match nothing" and since 'nothing' can be matched at any point in a string it will fail.
So when $stop is true the pattern can never match, and when $stop is false it matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Same regex doesn't match twice - regex

Related

Reversing a string in perl without using "reverse" function

Performing two (clashing) interpolations on one string

regular expression for matching a string

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

Break from regex loop in Perl

Categories

Resources