What do these Perl regexes mean? - regex

What does the following syntax mean in Perl?
$line =~ /([^:]+):/;
and
$line =~ s/([^:]+):/$replace/;

See perldoc perlreref
[^:]
is a character class that matches any character other than ':'.
[^:]+
means match one or more of such characters.
I am not sure the capturing parentheses are needed. In any case,
([^:]+):
captures a sequence of one or more non-colon characters followed by a colon.

$line =~ /([^:]+):/;
The =~ operator is called the binding operator, it runs a regex or substitution against a scalar value (in this case $line). As for the regex itself, () specify a capture. Captures place the text that matches them in special global variables. These variables are numbered starting from one and correspond to the order the parentheses show up in, so given
"abc" =~ /(.)(.)(.)/;
the $1 variable will contain "a", the $2 variable will contain "b", and the $3 variable will contain "c" (if you haven't guessed yet . matches one character*). [] specifies a character class. Character classes will match one character in them, so /[abc]/ will match one character if it is "a", "b", or "c". Character classes can be negated by starting them with ^. A negated character class matches one character that is not listed in it, so [^abc] will match one character that is not "a", "b", or "c" (for instance, "d" will match). The + is called a quantifier. Quantifiers tell you how many times the preceding pattern must match. + requires the pattern to match one or more times. (the * quantifier requires the pattern to match zero or more times). The : has no special meaning to the regex engine, so it just means a literal :.
So, putting that information together we can see that the regex will match one or more non-colon characters (saving this part to $1) followed by a colon.
$line =~ s/([^:]+):/$replace/;
This is a substitution. Substitutions have two parts, the regex, and the replacement string. The regex part follows all of the same rules as normal regexes. The replacement part is treated like a double quoted string. The substitution replaces whatever matches the regex with the replacement, so given the following code
my $line = "key: value";
my $replace = "option";
$line =~ s/([^:]+):/$replace/;
The $line variable will hold the string "option value".
You may find it useful to read perldoc perlretut.
* except newline, unless the /m option is used, in which case it matches any character

The first one captures the part in front of a colon from a line, such as "abc" in the string "abc:foo". More precisely it matches at least one non-colon character (though as many as possible) directly before a colon and puts them into a capture group.
The second one substitutes said part, although this time including the colon by the contents of the variable $replace.

I may be misunderstanding some of the previous answers, but I think that there's a confusion about the second example. It will not replace only the captured item (i.e., one or more non-colons up until a colon) by $replaced. It will replace all of ([^:]+): with $replace - the colon as well. (The substitution operates on the match, not just the capture.)
This means if you don't include a colon in $replace (and you want one), you will get bit:
my $line = 'http://www.example.com/';
my $replace = 'ftp';
$line =~ s/([^:]+):/$replace/;
print "Here's \$line now: $line\n";
Output:
Here's $line now: ftp//www.example.com/ # Damn, no colon!
I'm not sure if you are just looking at example code, but you unless you plan to use the capture I'm not sure you really want it in these examples.
If you are very unfamiliar with regular expressions (or Perl), you should look at perldoc perlrequick before trying perldoc perlre or perldoc perlretut.

You want to return something matching one or more characters that are anything but : followed by a : and the second one you want to do the same thing but replace it with $replace.

perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new('([^:]+):')->explain"
The regular expression:
(?-imsx:([^:]+):)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

$line =~ /([^:]+):/;
Matches anything that does not contain : before :/
If $line = "http://www.google.com", it will match http (the variable $1 will contain http)
$line =~ s/([^:]+):/$replace/;
This time, replace the value matched by the content of the variable $replace

Related

Perl greedy regex is not acting greedy

Giving the following code:
use strict;
use warnings;
my $text = "asdf(blablabla)";
$text =~ s/(.*?)\((.*)\)/$2/;
print "\nfirst match: $1";
print "\nsecond match: $2";
I expected that $2 would catch my last bracket, yet my output is:
If .* by default it's greedy why it stopped at the bracket?
The .* is a greedy subpattern, but it does not account for grouping. Grouping is defined with a pair of unescaped parentheses (see Use Parentheses for Grouping and Capturing).
See where your group boundaries are:
s/(.*?)\((.*)\)/$2/
| G1| |G2|
So, the \( and \) matching ( and ) are outside the groups, and will not be part of neither $1 nor $2.
If you need the ) be part of $2, use
s/(.*?)\((.*\))/$2/
^
A regex engine is processing both the string and the pattern from left to right. The first (.*?) is handled first, and it matches up to the first literal ( symbol as it is lazy (matches as few chars as possible before it can return a valid match), and the whole part before the ( is placed into Group 1 stack. Then, the ( is matched, but not captured, then (.*) matches any 0+ characters other than a newline up to the last ) symbol, and places the capture into Group 2. Then, the ) is just matched. The point is that .* grabs the whole string up to the end, but then backtracking happens since the engine tries to accommodate for the final ) in the pattern. The ) must be matched, but not captured in your pattern, thus, it is not part of Group 2 due to the group boundary placement. You can see the regex debugger at this regex demo page to see how the pattern matches your string.

What is happening inside this regex alteration expression

The following regular expresion works but can anyone explain how?
Any comment is appreciated! Thanks! Quinoa
What is the regex "|" doing to strip the tags "" and "" from <script>Keep THIS</Script> to get "Keep THIS" into memory $1?
Here is the REGEX:
(?x)
([\w\.!?,\s-])|<.*?>|.
Here is the string:
<script>Keep THIS</Script>
Results: $1 = "Keep THIS"
Commented below:
(?x) set flags for this block (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with .
not matching \n)
( group and capture to \1:
[\w\.!?,\s-] any character of: word characters (a-z,
A-Z, 0-9, _), '\.', '!', '?', ',',
whitespace (\n, \r, \t, \f, and " "), '-
'
) end of \1
| OR
< '<'
.? any character except \n (optional
(matching the most amount possible))
> '>'
| OR
. any character except \n
<.*?> matches all the tags , that is it matches all the strings which starts with < and endswith >. Then from the remaining string this ([\w\.!?,\s-]) regex would capture all the word character or dot or ! or ? or space or comma or hyphen. Note that it would capture each single character into group 1.
If you want to capture the whole string Keep THIS into group 1 then you need to add + quantifier next to the character class. + repeats the previous token one or more times.
([\w\.!?,\s-]+)|<.*?>|.
Finally the . matches all the remaining characters which are not matched.
DEMO
The only way this does what you say is if you are using a global match in a loop, and don't have use warnings in place as you should.
Here's what I think you have, but using Data::Dump to display the contents of $1 instead of what is presumably print $1 in your own code. (It really helps a lot to show your actual Perl code instead of selected snippets.)
use strict;
use warnings;
use Data::Dump;
my $s = '<script>Keep THIS</Script>';
my $re = qr/(?x)
([\w\.!?,\s-])|<.*?>|./;
while ( $s =~ /$re/g ) {
dd $1;
}
output
undef
"K"
"e"
"e"
"p"
" "
"T"
"H"
"I"
"S"
undef
The first pass is matching <script>, which isn't captured so $1 is undefined.
Subsequent passes match a single character from the class [\w\.!?,\s-], which consumes the string Keep THIS one character at a time.
Finally, the closing </Script> is matched without capturing, and leaves $1 undefined again.
undef is printed as a null string, and without warnings enabled you won't be alerted to it.
The solution is to always use a poper HTML parser to process HTML. Regular expressions are the wrong tool for the job.

~m, s and () in perl regexp

I am trying to get hold of regular expressions in Perl. Can anyone please provide any examples of what matches and what doesn't for the below regular expression?
$sentence =~m/.+\/(.+)/s
=~ is the binding operator; it makes the regex match be performed on $sentence instead of the default $_. m is the match operator; it is optional (e.g. $foo =~ /bar/) when the regex is delimited by / characters but required if you want to use a different delimiter.
s is a regex flag that makes . in the regex match any characters; by default . does not match newlines.
The actual regex is .+\/(.+); this will match one or more characters, then a literal / character, then one or more other characters. Because the initial .+ consumes as much as possible while still allowing the regex to succeed, it will match up to the last / in the string that has at least one character after it; then the (.+) will capture the characters that follow that / and make them available as $1.
So it is essentially capturing the final component of a filepath. Of foo/bar it will capture the bar, of foo/bar/ it will capture the bar/. Strings with only one component, like /foo or bar/ or baz will not match.
Any string, including multi-line strings, that contain a slash character somewhere in the middle of the string.
Matches:
foo/bar
asdf\nwrqwer/wrqwerqw # /s modifier allows '.' to match newlines
Doesn't match:
asdfasfdasf # no slash character
/asdfasdf # no characters before the slash
asdfasf/ # no characters after the slash
In addition, the entire substring that follows the last slash in the string will be captured and assigned to the variable $1.
Breakdown:
$sentence =~ — match $sentence with
m/ — the pattern consisting of
. — any character
+ — one or more times
\/ — then a forward-slash
( — and, saving in the $1 capture group,
.+ — any character one or more times
)
/s — allowing . to match newlines
See perldoc perlop for information about operators such as =~ and quote-like operators such as m//, and perldoc perlre about regular expressions and their options such as /s.

What does this snippet do in perl?

What does the following do in Perl?
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
I undestand that [^a-zA-Z0-9]+ is a start of sentence and at least one of a-zA-Z0-9 and \s+ is at least one whitespace.
But I can not figure out what this snippet does as a whole.
First, it replaces any sequence of non-alphanumeric characters (being neither upper case chars, lower case chars nor numbers) in the string with a single space.
After that it replaces all multi-spaces, i.e. any sequence of whitespaces with just one space character.
the first pattern replace all that is not alphanumeric by a space.
The second replace any number of white characters (space, tab, newlines) by a single space
Note that you can replace these two patterns by an only pattern:
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is more commonly written as
$string =~ s/[^a-zA-Z0-9]+/ /sg;
$string =~ s/\s+/ /sg;
The choice of delimiter isn't significant, but / is used by convention unless the pattern contains many some /.
Here we have two instances of the substitution operator. Between the first two delimiters is a regular expression pattern to search for. Between the last two delimiters is the string with which to replace the matching text. The trailing s and g are flags.
The s flag affects what . matches. Given that . isn't used, the s flag is useless.
The g flag causes the all matches to be replaced instead of just the first one.
The first regex pattern, [^a-zA-Z0-9]
[...] is a character class that matches a single character among those specified. A leading ^ negates the class, so [^a-zA-Z0-9] matches any character other than unaccented latin letters and numbers.
atom+ matches atom one or more times, so [^a-zA-Z0-9]+ matches a sequence of non-alphanumeric characters (and some alphanumeric characters such as "é").
Therefore, s/[^a-zA-Z0-9]+/ /g replaces all sequences of non-alphanumeric characters (and some alphanumeric characters such as "é") with a single space. For example, "abc - déf :)" becomes "abc d f ".
The second regex pattern, \s+
\s matches any whitespace character (except the vertical tab and the non-breaking space sometimes).
Therefore, s/\s+/ /g replaces all sequences of white space with a single space. For example, "abc\tdef ghi\n" becomes "abc def ghi ".
As a whole
When used together, the second statement does absolutely nothing. There will never be any sequences of two or more whitespace characters left in $string after the first statement.
So
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is the same as
$string =~ s/[^a-zA-Z0-9]+/ /g;

What does this Perl regex mean: m/(.*?):(.*?)$/g?

I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.