Why `stoutest` is not a valid regular expression? - regex

From perlop:
If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.
So I can pick up any letter as a delimiter. Eventually this regex should be fine:
stoutest
That can be rewritten
s/ou/es/
However it does not seems to work in Perl. Why?
$ perl -e '$_ = qw/ou/; stoutest; print'
ou

Because Perl can't pick out the operator s
perldoc perlop says this
Any non-whitespace delimiter may replace the slashes. Add space after the s when using a character allowed in identifiers.
This program works fine
my $s = 'bout';
$s =~ s toutest;
say $s;
output
best

Because stoutest, or any other string of alphanumeric characters, is a single token in the eyes of the Perl parser. Otherwise we couldn't use any barewords that begin with s (or m, or q, or y).
This works, though
$_ = "ou";
s toutest;
print

The substitute operator starts with an s identifier, and you code doesn't have one. Gotta use
s toutest
If it worked the way you think, we couldn't have any operators or subroutines that start with m, s, tr, q or y since all of them can be followed by any non-whitespace delimiter.
Ironically, your very own code proves demonstrates why it can't be the way you think. If it worked the way you think
$_ = qw/ou/; stoutest; print
wouldn't be equivalent to
$_ = qw/ou/; s/ou/es/; print
It would be equivalent to
$_ = q'/ou/; stoutest; print
aka
$_ = '/ou/; stoutest; print

Related

How to use the literal string "STDIN" in negative lookbehind in perl? [duplicate]

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Regular Expression to find $0.00

Need to count the number of "$0.00" in a string. I'm using:
my $zeroDollarCount = ("\Q$menu\E" =~ tr/\$0\.00//);
but it doesn't work. The issue is the $ sign is throwing the regex off. It works if I just want to count the number of $, but fails to find $0.00.
How is this a duplicate? Your solution does not address dollar sign which is an issue for me.
You are using the transliteration operator tr///. That doesn't have anything to do with a pattern. You need the match operator m// instead. And because you want it to find all occurances of the pattern, use the /g modifier.
my $count = () = $menu =~ m/\$0\.00/g;
If we run this program, the output is 2.
use strict;
use warnings;
my $menu = '$0.00 and $0.00';
my $count = () = $menu =~ m/\$0\.00/g;
print $count;
Now lets take a look at what is going on. First, the pattern of the match.
/\$0\.00/
This is fairly straight-forward. There is a literal $, which we need to escape with a backslash \. The zero is followed by a literal dot ., which again we need to escape, because like the $ it has special meanings in regular expressions.
my $count = () = $menu =~ m/\$0\.00/g;
This whole line looks weird. We can break it up into a few lines to make it more readable.
my #matches = ( $menu =~ m/\$0\.00/g );
my $count = scalar #matches;
We need the /g switch on the regular expression match to make it match all occurrences. In list context, the match operation returns all matches (which will be the string "$0.00" a number of times). Because we want the count, we then force that into scalar context, which gives us the number of elements. That can be shortened to one line by the idiom shown above.

Perl regex - can I say 'if character/string matches, delete it and all to right of it'?

I have an array of strings, some of which contain the character '-'. I want to be able to search for it and for those strings that contain it I wish to delete all characters to the right of it.
So for example if I have:
$string1 = 'home - London';
$string2 = 'office';
$string3 = 'friend-Manchester';
or something as such, then the affected strings would become:
$string1 = 'home';
$string3 = 'friend';
I don't know if the white-space before the '-' would be included in the string afterwards (I don't want it as I will be comparing strings at a later point, although if it doesn't affect string comparisons then it doesn't matter).
I do know that I can search and replace specific strings/characters using something like:
$string1 =~ s/-//
or
$string1 =~ tr/-//
but I'm not very familiar with regular expressions in Perl so I'm not 100% sure of these. I've looked around and couldn't see anything to do with 'to the right of' in regex. Help appreciated!
You can delete anything after a hyphen - with this substitution:
s/-.*$//s
However, you will want to remove the whitespace prior to the hyphen and thus do
s/\s* - .* $//xs
The $ anchores the regex at the end of the string and the /s flag allows the dot to match newlines as well. While the $ is superfluous, it might add clarity.
Your substitution would just have removed the first -, and your transliteration would have removed all hyphens from the string.
Your regular expressions are just searching for the dash, so that's all they replace. You want to search for the dash, and anything after it.
$string =~ s/-.*//;
. represents any character, * means search for that character 0 or more times, and match as many as possible (i.e. to the end of the string if possible)
You can also search for an optional space before it.
$string =~ s/\s?-.*//;
(\s is a clearer way to specify a space character)
Using plain substr() and index() is possible as well.
my #strings = ("we are - so cool",
"lonely",
"friend-Manchester",
"home - london",
"home-new york",
"home with-childeren-first episode");
local $/ = " ";
foreach (#strings) {
$_ = substr($_,0,index($_,'-')) if (index($_,'-') != -1);
chomp;
}
The other answers are good. However, in light of what you said:
...if it doesn't affect string comparisons then it doesn't matter
You don't need a separate step for this at all. Suppose you want to compare $stringwith another variable, $search_string. The following expression will check for an exact match, except that it ignores anything $string has after a dash:
if ($string =~ /^$search_string(\s*-|$)/) { print "Strings matched"; }
#Using Regex:
my #strings =
("we are - so cool",
"lonely",
"friend - Manchester",
"home - london",
"home - new york",
"home with-childeren-first episode"
);
foreach (#strings) {
$_ =~ s/-\s*[a-zA-Z ]+\s*//g;
print "NEW: ".$_."\n";
}

How can I convert a string into a regular expression that matches itself in Perl?

How can I convert a string to a regular expression that matches itself in Perl?
I have a set of strings like these:
Enter your selection:
Enter Code (Navigate, Abandon, Copy, Exit, ?):
and I want to convert them to regular expressions sop I can match something else against them. In most cases the string is the same as the regular expression, but not in the second example above because the ( and ? have meaning in regular expressions. So that second string needs to be become an expression like:
Enter Code \(Navigate, Abandon, Copy, Exit, \?\):
I don't need the matching to be too strict, so something like this would be fine:
Enter Code .Navigate, Abandon, Copy, Exit, ..:
My current thinking is that I could use something like:
s/[\?\(\)]/./g;
but I don't really know what characters will be in the list of strings and if I miss a special char then I might never notice the program is not behaving as expected. And I feel that there should exist a general solution.
Thanks.
As Brad Gilbert commented use quotemeta:
my $regex = qr/^\Q$string\E$/;
or
my $quoted = quotemeta $string;
my $regex2 = qr/^$quoted$/;
There is a function for that quotemeta.
quotemeta EXPR
Returns the value of EXPR
with all non-"word" characters
backslashed. (That is, all characters
not matching /[A-Za-z_0-9]/ will be
preceded by a backslash in the
returned string, regardless of any
locale settings.) This is the internal
function implementing the \Q escape in
double-quoted strings.
If EXPR is omitted, uses $_.
From http://www.regular-expressions.info/characters.html :
there are 11 characters with special meanings: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket )
In Perl (and PHP) there is a special function quotemeta that will escape all these for you.
To put Brad Gilbert's suggestion into an answer instead of a comment, you can use quotemeta function. All credit to him
Why use a regular expression at all? Since you aren't doing any capturing and it seems you will not be going to allow for any variations, why not simply use the index builtin?
$s1 = 'hello, (world)?!';
$s2 = 'he said "hello, (world)?!" and nothing else.';
if ( -1 != index $s2, $s1 ) {
print "we've got a match\n";
}
else {
print "sorry, no match.\n";
}