The following regular expresion works but can anyone explain how?
Any comment is appreciated! Thanks! Quinoa
What is the regex "|" doing to strip the tags "" and "" from <script>Keep THIS</Script> to get "Keep THIS" into memory $1?
Here is the REGEX:
(?x)
([\w\.!?,\s-])|<.*?>|.
Here is the string:
<script>Keep THIS</Script>
Results: $1 = "Keep THIS"
Commented below:
(?x) set flags for this block (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with .
not matching \n)
( group and capture to \1:
[\w\.!?,\s-] any character of: word characters (a-z,
A-Z, 0-9, _), '\.', '!', '?', ',',
whitespace (\n, \r, \t, \f, and " "), '-
'
) end of \1
| OR
< '<'
.? any character except \n (optional
(matching the most amount possible))
> '>'
| OR
. any character except \n
<.*?> matches all the tags , that is it matches all the strings which starts with < and endswith >. Then from the remaining string this ([\w\.!?,\s-]) regex would capture all the word character or dot or ! or ? or space or comma or hyphen. Note that it would capture each single character into group 1.
If you want to capture the whole string Keep THIS into group 1 then you need to add + quantifier next to the character class. + repeats the previous token one or more times.
([\w\.!?,\s-]+)|<.*?>|.
Finally the . matches all the remaining characters which are not matched.
DEMO
The only way this does what you say is if you are using a global match in a loop, and don't have use warnings in place as you should.
Here's what I think you have, but using Data::Dump to display the contents of $1 instead of what is presumably print $1 in your own code. (It really helps a lot to show your actual Perl code instead of selected snippets.)
use strict;
use warnings;
use Data::Dump;
my $s = '<script>Keep THIS</Script>';
my $re = qr/(?x)
([\w\.!?,\s-])|<.*?>|./;
while ( $s =~ /$re/g ) {
dd $1;
}
output
undef
"K"
"e"
"e"
"p"
" "
"T"
"H"
"I"
"S"
undef
The first pass is matching <script>, which isn't captured so $1 is undefined.
Subsequent passes match a single character from the class [\w\.!?,\s-], which consumes the string Keep THIS one character at a time.
Finally, the closing </Script> is matched without capturing, and leaves $1 undefined again.
undef is printed as a null string, and without warnings enabled you won't be alerted to it.
The solution is to always use a poper HTML parser to process HTML. Regular expressions are the wrong tool for the job.
Related
I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/
I am trying to suppress strings that begin with [T without doing a positive match and negating the results.
my #tests = ("OT", "[T","NOT EXCLUDED");
foreach my $test (#tests)
{
#match from start of string,
#include 'Not left sq bracket' then include 'Not capital T'
if ($test =~ /^[^\[][^T]/) #equivalent to /^[^\x5B][^T]/
{
print $test,"\n";
}
}
Outputs
NOT EXCLUDED
My question is, can somebody tell me why OT is being excluded in the above example?
EDIT
Thanks for your replies so far everybody, I can see I was being a bit stoopid.
The regex ^[^\[][^T] matches string that begin with a character other than [ followed by a character other than T.
Since OT has T as 2nd character, it is not matched.
If you want to match any string other than those that begin with [T, you can do:
if ($test =~ /^(?!\[T)/) {
print $test,"\n";
}
YAPE::Regex::Explain can be helpful:
$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(qr/^[^\[][^T]/)->explain'
The regular expression:
(?-imsx:^[^\[][^T])
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
[^\[] any character except: '\['
----------------------------------------------------------------------
[^T] any character except: 'T'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Your regex translates to:
From start of the input, match anything but an open square bracket ([) followed by anything but a capital T
OT fails to match
[T as well
Your expression is equivalent to "begins with NOT [ and second one is NOT T", so the only one that passes is NOT EXCLUDED, because in OT, the second letter is T
If I run
"Year 2010" =~ /([0-4]*)/;
print $1;
I get empty string.
But
"Year 2010" =~ /([0-4]+)/;
print $1;
outputs "2010". Why?
You get an empty match right at the start of the string "Year 2010" for the first form because the * will immediately match 0 digits. The + form will have to wait until it sees at least one digit before it matches.
Presumably if you can go through all the matches of the first form, you'll eventually find 2010... but probably only after it finds another empty match before the 'e', then before the 'a' etc.
The first regular expression successfully matches zero digits at the start of the string, which results in capturing the empty string.
The second regular expression fails to match at the start of the string, but it does match when it reaches 2010.
The first matches the zero-length string at the beginning (before Y) and returns it. The second searches for one-or-more digits and waits until it finds 2010.
you can also use YAPE::Regex::Explain for explanation of a regular expression like
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new('([0-4]*)')->explain();
print YAPE::Regex::Explain->new('([0-4]+)')->explain();
output:
The regular expression:
(?-imsx:([0-4]*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]* any character of: '0' to '4' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The regular expression:
(?-imsx:([0-4]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]+ any character of: '0' to '4' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The star symbol tries to basically match 0 or more symbols in given set (in theory, the set {x,y}* consists of empty string and all possible finite sequences made of x and y), and therefore, it will match exactly zero characters (empty string) at the beginning of the string, zero characters after first character, zero characters after the second character, etc. Then finally it will find 2 and match whole 2010.
The plus symbol matches one or more characters from the given set ({x,y}+ consists of all possible finite sequences made of x and y, without the empty string, as opposed to {x,y}*). So the first met matching character is 2, then next - 0 is checked, then 1, then another 0, and then the sentence ends, so found group looks like '2010'.
It is standard behavior for regular expressions, defined in formal language theory. I strongly suggest to learn a bit theory about regular expressions, it can't hurt, but can help :)
We have this as a trick question in Learning Perl. Any regex that can match zero characters that doesn't match at the beginning of the string will match zero characters.
The Perl regex engine matches the leftmost longest match, with the leftmost part coming first. Not all regex engines work like that, though. If you want all of the technical details, read Mastering Regular Expressions, which explains how regex engines work and find matches.
To make your first RE match, use the anchor '$':
"Year 2010" =~ /([0-4]*)$/;
print $1;
I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.
What does the following syntax mean in Perl?
$line =~ /([^:]+):/;
and
$line =~ s/([^:]+):/$replace/;
See perldoc perlreref
[^:]
is a character class that matches any character other than ':'.
[^:]+
means match one or more of such characters.
I am not sure the capturing parentheses are needed. In any case,
([^:]+):
captures a sequence of one or more non-colon characters followed by a colon.
$line =~ /([^:]+):/;
The =~ operator is called the binding operator, it runs a regex or substitution against a scalar value (in this case $line). As for the regex itself, () specify a capture. Captures place the text that matches them in special global variables. These variables are numbered starting from one and correspond to the order the parentheses show up in, so given
"abc" =~ /(.)(.)(.)/;
the $1 variable will contain "a", the $2 variable will contain "b", and the $3 variable will contain "c" (if you haven't guessed yet . matches one character*). [] specifies a character class. Character classes will match one character in them, so /[abc]/ will match one character if it is "a", "b", or "c". Character classes can be negated by starting them with ^. A negated character class matches one character that is not listed in it, so [^abc] will match one character that is not "a", "b", or "c" (for instance, "d" will match). The + is called a quantifier. Quantifiers tell you how many times the preceding pattern must match. + requires the pattern to match one or more times. (the * quantifier requires the pattern to match zero or more times). The : has no special meaning to the regex engine, so it just means a literal :.
So, putting that information together we can see that the regex will match one or more non-colon characters (saving this part to $1) followed by a colon.
$line =~ s/([^:]+):/$replace/;
This is a substitution. Substitutions have two parts, the regex, and the replacement string. The regex part follows all of the same rules as normal regexes. The replacement part is treated like a double quoted string. The substitution replaces whatever matches the regex with the replacement, so given the following code
my $line = "key: value";
my $replace = "option";
$line =~ s/([^:]+):/$replace/;
The $line variable will hold the string "option value".
You may find it useful to read perldoc perlretut.
* except newline, unless the /m option is used, in which case it matches any character
The first one captures the part in front of a colon from a line, such as "abc" in the string "abc:foo". More precisely it matches at least one non-colon character (though as many as possible) directly before a colon and puts them into a capture group.
The second one substitutes said part, although this time including the colon by the contents of the variable $replace.
I may be misunderstanding some of the previous answers, but I think that there's a confusion about the second example. It will not replace only the captured item (i.e., one or more non-colons up until a colon) by $replaced. It will replace all of ([^:]+): with $replace - the colon as well. (The substitution operates on the match, not just the capture.)
This means if you don't include a colon in $replace (and you want one), you will get bit:
my $line = 'http://www.example.com/';
my $replace = 'ftp';
$line =~ s/([^:]+):/$replace/;
print "Here's \$line now: $line\n";
Output:
Here's $line now: ftp//www.example.com/ # Damn, no colon!
I'm not sure if you are just looking at example code, but you unless you plan to use the capture I'm not sure you really want it in these examples.
If you are very unfamiliar with regular expressions (or Perl), you should look at perldoc perlrequick before trying perldoc perlre or perldoc perlretut.
You want to return something matching one or more characters that are anything but : followed by a : and the second one you want to do the same thing but replace it with $replace.
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new('([^:]+):')->explain"
The regular expression:
(?-imsx:([^:]+):)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$line =~ /([^:]+):/;
Matches anything that does not contain : before :/
If $line = "http://www.google.com", it will match http (the variable $1 will contain http)
$line =~ s/([^:]+):/$replace/;
This time, replace the value matched by the content of the variable $replace