Why does this regular expression suppress more than I was expecting? - regex

I am trying to suppress strings that begin with [T without doing a positive match and negating the results.
my #tests = ("OT", "[T","NOT EXCLUDED");
foreach my $test (#tests)
{
#match from start of string,
#include 'Not left sq bracket' then include 'Not capital T'
if ($test =~ /^[^\[][^T]/) #equivalent to /^[^\x5B][^T]/
{
print $test,"\n";
}
}
Outputs
NOT EXCLUDED
My question is, can somebody tell me why OT is being excluded in the above example?
EDIT
Thanks for your replies so far everybody, I can see I was being a bit stoopid.

The regex ^[^\[][^T] matches string that begin with a character other than [ followed by a character other than T.
Since OT has T as 2nd character, it is not matched.
If you want to match any string other than those that begin with [T, you can do:
if ($test =~ /^(?!\[T)/) {
print $test,"\n";
}

YAPE::Regex::Explain can be helpful:
$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(qr/^[^\[][^T]/)->explain'
The regular expression:
(?-imsx:^[^\[][^T])
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
[^\[] any character except: '\['
----------------------------------------------------------------------
[^T] any character except: 'T'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Your regex translates to:
From start of the input, match anything but an open square bracket ([) followed by anything but a capital T
OT fails to match
[T as well

Your expression is equivalent to "begins with NOT [ and second one is NOT T", so the only one that passes is NOT EXCLUDED, because in OT, the second letter is T

Related

Regex for matching single digit preceded by char [duplicate]

I want to extract a substring from a line in Perl. Let me explain giving an example:
fhjgfghjk3456mm 735373653736
icasd 666666666666
111111111111
In the above lines, I only want to extract the 12 digit number. I tried using split function:
my #cc = split(/[0-9]{12}/,$line);
print #cc;
But what it does is removes the matched part of the string and stores the residue in #cc. I want the part matching the pattern to be printed. How do I that?
You can do it with regular expressions:
#!/usr/bin/perl
my $string = 'fhjgfghjk3456mm 735373653736 icasd 666666666666 111111111111';
while ($string =~ m/\b(\d{12})\b/g) {
say $1;
}
Test the regex here: http://rubular.com/r/Puupx0zR9w
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/\b(\d+)\b/)->explain();
The regular expression:
(?-imsx:\b(\d+)\b)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The $1 built-in variable stores the last match from a regex. Also, if you perform a regex on a whole string, it will return the whole string. The best solution here is to put parentheses around your match then print $1.
my $strn = "fhjgfghjk3456mm 735373653736\nicasd\n666666666666 111111111111";
$strn =~ m/([0-9]{12})/;
print $1;
This makes our regex match JUST the twelve digit number and then we return that match with $1.
#!/bin/perl
my $var = 'fhjgfghjk3456mm 735373653736 icasd 666666666666 111111111111';
if($var =~ m/(\d{12})/) {
print "Twelve digits: $1.";
}
#!/usr/bin/env perl
undef $/;
$text = <DATA>;
#res = $text =~ /\b\d{12}\b/g;
print "#res\n";
__DATA__
fhjgfghjk3456mm 735373653736
icasd 666666666666
111111111111

What is happening inside this regex alteration expression

The following regular expresion works but can anyone explain how?
Any comment is appreciated! Thanks! Quinoa
What is the regex "|" doing to strip the tags "" and "" from <script>Keep THIS</Script> to get "Keep THIS" into memory $1?
Here is the REGEX:
(?x)
([\w\.!?,\s-])|<.*?>|.
Here is the string:
<script>Keep THIS</Script>
Results: $1 = "Keep THIS"
Commented below:
(?x) set flags for this block (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with .
not matching \n)
( group and capture to \1:
[\w\.!?,\s-] any character of: word characters (a-z,
A-Z, 0-9, _), '\.', '!', '?', ',',
whitespace (\n, \r, \t, \f, and " "), '-
'
) end of \1
| OR
< '<'
.? any character except \n (optional
(matching the most amount possible))
> '>'
| OR
. any character except \n
<.*?> matches all the tags , that is it matches all the strings which starts with < and endswith >. Then from the remaining string this ([\w\.!?,\s-]) regex would capture all the word character or dot or ! or ? or space or comma or hyphen. Note that it would capture each single character into group 1.
If you want to capture the whole string Keep THIS into group 1 then you need to add + quantifier next to the character class. + repeats the previous token one or more times.
([\w\.!?,\s-]+)|<.*?>|.
Finally the . matches all the remaining characters which are not matched.
DEMO
The only way this does what you say is if you are using a global match in a loop, and don't have use warnings in place as you should.
Here's what I think you have, but using Data::Dump to display the contents of $1 instead of what is presumably print $1 in your own code. (It really helps a lot to show your actual Perl code instead of selected snippets.)
use strict;
use warnings;
use Data::Dump;
my $s = '<script>Keep THIS</Script>';
my $re = qr/(?x)
([\w\.!?,\s-])|<.*?>|./;
while ( $s =~ /$re/g ) {
dd $1;
}
output
undef
"K"
"e"
"e"
"p"
" "
"T"
"H"
"I"
"S"
undef
The first pass is matching <script>, which isn't captured so $1 is undefined.
Subsequent passes match a single character from the class [\w\.!?,\s-], which consumes the string Keep THIS one character at a time.
Finally, the closing </Script> is matched without capturing, and leaves $1 undefined again.
undef is printed as a null string, and without warnings enabled you won't be alerted to it.
The solution is to always use a poper HTML parser to process HTML. Regular expressions are the wrong tool for the job.

cryptic perl expression

I find the following statement in a perl (actually PDL) program:
/\/([\w]+)$/i;
Can someone decode this for me, an apprentice in perl programming?
Sure, I'll explain it from the inside out:
\w - matches a single character that can be used in a word (alphanumeric, plus '_')
[...] - matches a single character from within the brackets
[\w] - matches a single character that can be used in a word (kinda redundant here)
+ - matches the previous character, repeating as many times as possible, but must appear at least once.
[\w]+ - matches a group of word characters, many times over. This will find a word.
(...) - grouping. remember this set of characters for later.
([\w]+) - match a word, and remember it for later
$ - end-of-line. match something at the end of a line
([\w]+)$ - match the last word on a line, and remember it for later
\/ - a single slash character '/'. it must be escaped by backslash, because slash is special.
\/([\w]+)$ - match the last word on a line, after a slash '/', and remember the word for later. This is probably grabbing the directory/file name from a path.
/.../ - match syntax
/.../i - i means case-insensitive.
All together now:
/\/([\w]+)$/i; - match the last word on a line and remember it for later; the word must come after a slash. Basically, grab the filename from an absolute path. The case insensitive part is irrelevant, \w will already match both cases.
More details about Perl regex here: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
And as JRFerguson pointed out, YAPE::Regex::Explain is useful for tokenizing regex, and explaining the pieces.
You will find the Yape::Regex::Explain module worth installing.
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this script is named 'rexplain' do:
$ ./rexplain '/\/([\w]+)$/i'
...to obtain:
The regular expression:
(?-imsx:/\/([\w]+)$/i)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
/i '/i'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
UPDATE:
See also: https://stackoverflow.com/a/12359682/1015385 . As noted there and in the module's documentation:
There is no support for regular expression syntax added after Perl version 5.6, particularly any
constructs added in 5.10.
/\/([\w]+)$/i;
It is a regex, and if it is a complete statement, it is applied to the $_ variable, like so:
$_ =~ /\/([\w]+)$/i;
It looks for a slash \/, followed by an alphanumeric string \w+, followed by end of line $. It also captures () the alphanumeric string, which ends up in the variable $1. The /i on the end makes it case-insensitive, which has no effect in this case.
While it doesn't help "explain" a regex, once you have a test case, Damian's new Regexp::Debugger is a cool utility to watch what actually occurs during the matching. Install it and then do rxrx at the command line to start the debugger, then type in /\/([\w]+)$/ and '/r' (for example), and finally m to start the matching. You can then step through the debugger by hitting enter repeatedly. Really cool!
This is comparing $_ to a slash followed by one or more character (case insensitive) and storing it in $1
$_ value then $1 value
------------------------------
"/abcdes" | "abcdes"
"foo/bar2" | "bar2"
"foobar" | undef # no slash so doesn't match
The Online Regex Analyzer deserves a mention. Here's a link to explain what your regex means, and pasted here for the record.
Sequence: match all of the followings in order
/ (slash)
--+
Repeat | (in GroupNumber:1)
AnyCharIn[ WordCharacter] one or more times |
--+
EndOfLine

Why does "Year 2010" =~ /([0-4]*)/ results in empty string in $1?

If I run
"Year 2010" =~ /([0-4]*)/;
print $1;
I get empty string.
But
"Year 2010" =~ /([0-4]+)/;
print $1;
outputs "2010". Why?
You get an empty match right at the start of the string "Year 2010" for the first form because the * will immediately match 0 digits. The + form will have to wait until it sees at least one digit before it matches.
Presumably if you can go through all the matches of the first form, you'll eventually find 2010... but probably only after it finds another empty match before the 'e', then before the 'a' etc.
The first regular expression successfully matches zero digits at the start of the string, which results in capturing the empty string.
The second regular expression fails to match at the start of the string, but it does match when it reaches 2010.
The first matches the zero-length string at the beginning (before Y) and returns it. The second searches for one-or-more digits and waits until it finds 2010.
you can also use YAPE::Regex::Explain for explanation of a regular expression like
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new('([0-4]*)')->explain();
print YAPE::Regex::Explain->new('([0-4]+)')->explain();
output:
The regular expression:
(?-imsx:([0-4]*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]* any character of: '0' to '4' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The regular expression:
(?-imsx:([0-4]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]+ any character of: '0' to '4' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The star symbol tries to basically match 0 or more symbols in given set (in theory, the set {x,y}* consists of empty string and all possible finite sequences made of x and y), and therefore, it will match exactly zero characters (empty string) at the beginning of the string, zero characters after first character, zero characters after the second character, etc. Then finally it will find 2 and match whole 2010.
The plus symbol matches one or more characters from the given set ({x,y}+ consists of all possible finite sequences made of x and y, without the empty string, as opposed to {x,y}*). So the first met matching character is 2, then next - 0 is checked, then 1, then another 0, and then the sentence ends, so found group looks like '2010'.
It is standard behavior for regular expressions, defined in formal language theory. I strongly suggest to learn a bit theory about regular expressions, it can't hurt, but can help :)
We have this as a trick question in Learning Perl. Any regex that can match zero characters that doesn't match at the beginning of the string will match zero characters.
The Perl regex engine matches the leftmost longest match, with the leftmost part coming first. Not all regex engines work like that, though. If you want all of the technical details, read Mastering Regular Expressions, which explains how regex engines work and find matches.
To make your first RE match, use the anchor '$':
"Year 2010" =~ /([0-4]*)$/;
print $1;

What does this Perl regex mean: m/(.*?):(.*?)$/g?

I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.