"Variable length lookbehind not implemented" but it isn't variable length - regex

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?

I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here

That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.

st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.

Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

Related

How to use the literal string "STDIN" in negative lookbehind in perl? [duplicate]

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Why `stoutest` is not a valid regular expression?

From perlop:
If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.
So I can pick up any letter as a delimiter. Eventually this regex should be fine:
stoutest
That can be rewritten
s/ou/es/
However it does not seems to work in Perl. Why?
$ perl -e '$_ = qw/ou/; stoutest; print'
ou
Because Perl can't pick out the operator s
perldoc perlop says this
Any non-whitespace delimiter may replace the slashes. Add space after the s when using a character allowed in identifiers.
This program works fine
my $s = 'bout';
$s =~ s toutest;
say $s;
output
best
Because stoutest, or any other string of alphanumeric characters, is a single token in the eyes of the Perl parser. Otherwise we couldn't use any barewords that begin with s (or m, or q, or y).
This works, though
$_ = "ou";
s toutest;
print
The substitute operator starts with an s identifier, and you code doesn't have one. Gotta use
s toutest
If it worked the way you think, we couldn't have any operators or subroutines that start with m, s, tr, q or y since all of them can be followed by any non-whitespace delimiter.
Ironically, your very own code proves demonstrates why it can't be the way you think. If it worked the way you think
$_ = qw/ou/; stoutest; print
wouldn't be equivalent to
$_ = qw/ou/; s/ou/es/; print
It would be equivalent to
$_ = q'/ou/; stoutest; print
aka
$_ = '/ou/; stoutest; print

Negative lookahead assertion with the * modifier in Perl

I have the (what I believe to be) negative lookahead assertion <#> *(?!QQQ) that I expect to match if the tested string is a <#> followed by any number of spaces (zero including) and then not followed by QQQ.
Yet, if the tested string is <#> QQQ the regular expression matches.
I fail to see why this is the case and would appreciate any help on this matter.
Here's a test script
use warnings;
use strict;
my #strings = ('something <#> QQQ',
'something <#> RRR',
'something <#>QQQ' ,
'something <#>RRR' );
print "$_\n" for map {$_ . " --> " . rep($_) } (#strings);
sub rep {
my $string = shift;
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
return $string;
}
This prints
something <#> QQQ --> something at w/o QQQ
something <#> RRR --> something at w/o RRR
something <#>QQQ --> something at w/ QQQ
something <#>RRR --> something at w/o RRR
And I'd have expected the first line to be something <#> QQQ --> something at w/ QQQ.
It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".
You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):
<#> *(?!QQQ)(?! )
ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:
unless ($string =~ s,<#> *QQQ,at w/ QQQ,) {
$string =~ s,<#> *,at w/o,;
}
You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.
A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<#> " as <#> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.
I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.
However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.
<#>(?! *QQQ) # Add the spaces to the look-ahead
The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:
<#> *(?! *QQQ)
This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.
The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:
+--------------- Matches "<#>".
| +----------- Matches "" (empty string).
| | +--- Doesn't match " QQQ".
| | |
--- ---- ---
'something <#> QQQ' =~ /<#> [ ]* (?!QQQ)/x
All you need to do is shuffle things around. Replace
/<#>[ ]*(?!QQQ)/
with
/<#>(?![ ]*QQQ)/
Or you can make it so the regex will only match all the spaces:
/<#>[ ]*+(?!QQQ)/
/<#>[ ]*(?![ ]|QQQ)/
/<#>[ ]*(?![ ])(?!QQQ)/
PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.

How can I find repeated letters with a Perl regex?

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:
booooooot or abbott
I won't know the letter I am looking for ahead of time.
This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.
You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.
my $str = "Foooooobar";
$str =~ /(\w)(\1+)/;
print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'
I think you actually want this rather than the "\w" as that includes numbers and the underscore.
([a-zA-Z])\1+
Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.
([[:alpha:]])\1+
I Think using a backreference would work:
(\w)\1+
\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.
(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)
Use \N to refer to previous groups:
/(\w)\1+/g
You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.
Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.
That might be easier to understand by framing it as the question:
"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/
#! /usr/local/bin/perl
use strict;
use warnings;
# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');
while (<DATA>) {
chomp;
if (/([^\W_0-9])\1+/) {
print "$_: dup [$1]\n";
}
else {
print "$_: nope\n";
}
}
__DATA__
100
food
créé
a::b
The following code will return all the characters, that repeat two or more times:
my $str = "SSSannnkaaarsss";
print $str =~ /(\w)\1+/g;
Just for kicks, a completely different approach:
if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}
FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.
How about:
(\w)\1+
The first part makes an unnamed group around a character, then the back-reference looks for that same character.
I think this should also work:
((\w)(?=\2))+\2
/(.)\\1{2,}+/u
'u' modifier matching with unicode