managing and documenting a multiline substitution in Perl - regex

I have recently been learning about the \x modifier in Perl Best Practices, enabling you to do cool things like multi-line indentation and documentation:
$txt =~ m/^ # anchor at beginning of line
The\ quick\ (\w+)\ fox # fox adjective
\ (\w+)\ over # fox action verb
\ the\ (\w+) dog # dog adjective
(?: # whitespace-trimmed comment:
\s* \# \s* # whitespace and comment token
(.*?) # captured comment text; non-greedy!
\s* # any trailing whitespace
)? # this is all optional
$ # end of line anchor
/x; # allow whitespace
However, I was unable to do the equivalent for find/replace string substitutions? Is there some other similar best practice that should be used to more effectively manage complex substitutions?
Edit Take this for an example:
$test =~ s/(src\s*=\s*['"]?)(.*?\.(jpg|gif|png))/${1}something$2/sig;
Is there a similar way that this could be documented using multi-line/whitespace for better readability?
Many thanks

Since you've chosen not to provide an example of something that doesn't work, I'll offer a few guesses at what you might be doing wrong:
Note that the delimiter (in your case /) cannot appear inside any comments inside the regex, because then they'll be indicating the end of the regex. For example, this:
s/foo # this is interesting and/or cool
/bar/x
will not work, because the regex is terminated by the slash between and and or.
Note that /x does not work on the replacement-string, only on the regex itself. For example this:
s/foo/bar # I love the word bar/x
will replace foo with bar # I love the word bar.
If you really want to be able to put comments in the replacement-string, then I suppose you could use a replacement-expression instead, using the /e flag. That would let you use the full syntax of Perl. For example:
s/foo/'bar' # I love the word bar/e
Here is an example that does work:
$test =~
s/
# the regex to replace:
(src\s*=\s*['"]?) # src=' or src=" (plus optional whitespace)
(.*?\.(jpg|gif|png)) # the URI of the JPEG or GIF or PNG image
/
# the string to replace it with:
$1 . # src=' or src=" (unchanged)
'something' . # insert 'something' at the start of the URI
$2 # the original URI
/sige;

If we just add the /x, we can break up the regular expression portion easily, including allowing comments.
my $test = '<img src = "http://www.somewhere.com/im/alright/jack/keep/your/hands/off/of/my/stack.gif" />';
$test =~ s/
( src \s* = \s* ['"]? ) # a src attribute ...
( .*?
\. (jpg|gif|png) # to an image file type, either jpeg, gif or png
)
/$1something$2/sigx # put 'something' in front of it
;
You have to use the evaluation switch (/e) if you want to break up the replacement. But the multi-line for the match portion, works fine.
Notice that I did not have to separate $1, because $1something is not a valid identifier anyway, so my version of Perl, at least, does not get confused.
For most of my evaluated replacements, I prefer the bracket style of substitution delimiter:
$test =~ s{
( src \s* = \s* ['"]? ) # a src attribute ... '
( .*?
\. (jpg|gif|png) # to an image file type, either jpeg, gif or png
)
}{
$1 . 'something' . $2
}sigxe
;
just to make it look more code-like.

Well
$test =~ s/(src\s*=\s*['"]?) # first group
(.*?\.(jpg|gif|png)) # second group
/${1}something$2/sigx;
should and does work indeed. Of course, you can't use this on the right part, unless you use somethig like :
$test =~ s/(src\s*=\s*['"]?) # first group
(.*?\.(jpg|gif|png)) # second group
/
$1 # Get 1st group
. "something" # Append ...
. $2 # Get 2d group
/sigxe;

s/foo/bar/
could be written as
s/
foo # foo
/
"bar" # bar
/xe
/x to allow whitespace in the pattern
/e to allow code in the replacement expression

Related

Issue with matching

The following code below is trying to match a format like
[a=>]b[->c][d:e]
where a=>, ->c, d:e are optional.
($reg =~ /^
(?:([\w\/]+)=>)? # (optional)
(\w+) # (required)
(?:->(\w+))? # (optional)
(\[\d+\]|\[\d+:\d+\])? # (optional)
.$/x)
or croak ("-E Invalid register format );
When I give the input as sample=>STATUS as $reg value, the last S of STATUS is getting truncated. Why?
The regex symbol . just before your $ line-end symbol captures "one thing" which in your case, seems to be the last letter S
This means that your regex is almost right, but that "one thing" needed to be satisfied by the regex, so the regex matcher rewound the required (\w+) pattern by one character to give the . its demanded character.
you need to add, and escape, the square brackets
my $regex=qr{^
(?:(\[[\w\/]+)=>\])? # (optional)
(\w+) # (required)
(?:\[->(\w+)\])? # (optional)
(?:\[\w+\]|\[\w+:\w+\])? # (optional)
}x;

Why isnt greedy matching working in perl regex group

I am trying to grab only whats BETWEEN the body tags in html with perl regex (so don't want to include the actual body tags, thus using the groups to throw away the tags to variables).
Here are some short test subjects:
<body>test1</body>
<body style="bob">test2</body>
So first, simple version I tried was:
(?<=<body>).*(?=</body>)
which returns test 1 and empty string
So then I tried:
(?<=<body).*(?=</body>)
Which now gives a result for both tests, but of course has garbage: ">test1" and " style="bob">test2"
I've tried every variation of greedy match now in the first version, e.g.:(?<=<body.*>).*(?=</body>)
But it simply will not work! Any time I put the * in there I get errors. Anybody able to help out?
I am trying to grab only whats BETWEEN the body tags
In that case:
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m{ <body [^>]*> (.+) </body> }xs) {
print "[$1]\n";
}
}
__DATA__
<body>test1</body>
<body style="bob">test2</body>
<!-- <body class="one"> --><body>This is why you should use an HTML parser</body>
Output:
[test1]
[test2]
[ --><body>This is why you should use an HTML parser]
You're looking for
while ($html =~ / <body[^>]*> ( (?: (?! </body\b ). )* ) /sxg) {
say $1;
}
I don't think using $& is efficient. Personally, I'd use capture groups
but this works pretty good.
/<(body)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>\K[\S\s]*?(?=<\/\1\s*>)/
https://regex101.com/r/EkPkLb/1
Expanded
<
( body ) # (1)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
\K
[\S\s]*?
(?= </ \1 \s* > )
Note that to really find a particular tag, you have to consume all the
previous tags via a (*SKIP)(?!), else your tag could be embedded inside
script literals, comments or invisible content.
I wouldn't worry too much about it.
If you're interested I could post a fairly large proper regex,
but I doubt you'd be interested.
Choosing the best pattern for your data depends on what kind of characters will be contained in your body tags. An additional consideration is whether you want to aim for efficiency or minimal memory.
These are some suitable (or not) patterns for your case:
93steps ~<body[^>]*>\K.*(?=</body>)~ #no capture group,no newline matches
105steps ~<body[^>]*>\K[\S\s]*?(?=</body>)~ #no capture group, newline matches
87steps ~<body[^>]*>(.*)</body>~ #capture group, no newline matches
96steps ~<body[^>]*>([\S\s]*?)</body>~ #capture group, newline matches
Here is a Pattern Demo with three samples to show the impact of newline characters in your body text.

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

Perl regex issue with brackets where content are multiline

I have a string in a file, which is to be read by Perl, and can either be:
previous content ending with a linebreak
keyword: content
next content
or
previous content, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c closed by matching parenthesis}
next content
In either case, I successfully loaded the contents, from the beginning of previous content, till the end of next, in a string, call it $str.
Now, I want to extract the stuff between the linebreak that ends previous content, and the linebreak before next content.
So I used a regex on $str like this:
if($str =~
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?!\{+) # NO { follows
\s+(?!\{+) # NO { with a heading whitespace
\s* # white space between keyword: and content
(?!\{+) # no { immediately before content
# question : should the last one be a negative lookbehind AFTER the check for content itself?
([^\s]+) # the content, should be in $1;
(?!\{+) # no trailing { immediately after content
\s+ # delimited by a whitespace, ignore what comes afterwards
| # or
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?=\s*{*\s*)*) # any mix of whitespace and {
(?=\{+) # at least one {
(?=\s*{*\s*)*) # again any mix of whitespace and {
([^\{\}]+) # no { or }
(?=\s*}*\s*)*) # any mix of whitespace and }
(?=\}+) # at least one }
(?=\s*}*\s*)*) # again any mix of whitespace and }
) { #do something with $1}
I realize that this one is not really addressing multiline information with nested parenthesis; however, it should capture objects in form keyword: {{ content} }
However, while I am able to capture the content in $1 in case of
keyword: content
form, I am unable to capture
keyword: {multiline with nested
{parenthesis} }
I finally did implement it using a simple counter based parser, instead of regex. I would love to know how can I do this in regex, to capture objects of the second form, with an explanation of the regex command, please.
Also, where did my formulation go wrong that it does not even capture single line content with multiple (but matched) heading and trailing parenthesis?
You can use this:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "previous content ending with a linebreak
keyword: content
next content
previous contnet, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c losed by matching parethesis}
next content";
while ($str =~ /\nkeyword:
(?| # branch reset: i.e. the two capture groups have the same number
\s*
({ (?> [^{}]++ | (?1) )*+ }) # recursive pattern
| # OR
\h*
(.*+) # capture all until the end of line
) # close the branch reset group
/xg ) {
print "$1\n";
}
This pattern try a possible content with nested curly brackets, if curly brackets are not found or are not balanced, the second alternative is tried and match only the content of the line (since the dot can't match newlines).
The branch reset feature (?|..|..) is useful to give the same number to the capturing group of each part of the alternation.
recursive pattern details:
( # open the capturing group 1
{ # literal opening curly bracket
(?> # atomic group: possible content between brackets
[^{}]++ # all that is not a curly bracket
| # OR
(?1) # recurse to the capturing group 1 (!here is the recursion!)
)*+ # repeat the atomic group zero or more times
} # literal closing curly bracket
) # close the capturing group 1
In this subpattern I use an atomic group (?>...) and possessive quantifiers ++ and *+ to avoid backtracking the most possible.
How about something like this?
if ($str =~ /keyword:\s*{(.*)}/s) {
my $key = $1;
if ($key =~ /([^{}]*)/) {
print "$1\n";
}
else {
print "$key\n";
}
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}
[^{|^}] is looking for a chunk of letters that doesn't have any braces in it i.e. the most inner letters of the nested braces.
The s modifier allows you to look at multiple lines even when using .*. However, you don't want to look at multiple lines for keywords without braces, so that part is in the elsif statement.
Do you need to have the same number of matching braces? For example, should keyword: {foo{bar{hello}}} output {{{hello}}}? If so, I feel like it would be better to stick with counters.
Edit:
For the input
keyword: {multiline
with nested {parenthesis} }
if you want the output
{multiline with nested {parenthesis} }
I believe that would be
if ($str =~ /keyword:\s*({.*})/s) {
my $match = $1;
$match =~ s/\n//g;
print "$match\n";
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}

Regex: match everything before image

url1: /dir-images/no1/top-left.gif
url2: /test-1/test-2/test
I want to match the path before the last slash if it is an image file(url1), aka /dir-images/no1/
and match the whole path if it is not(url2), /test-1/test-2/test
tried ^([\=\/\.\w-]+\/)+ this could get path before the last slash no matter what is after it..
Try:
^([\=/.\w-]+/)+((?!.*\.gif$).*|)
The part with (?!) is a lookahead. This is something like an if statement. There are two different lookaheads, ?= and ?!. The first one is a normal if, the second one is an 'if not'.
In your case, I just ask if the ending is not gif? And then I match everything.
One way (with perl flavour):
m|\A(.*/(?(?!.*\.gif$).*))|
Explanation:
m| ... | # Regexp.
\A # Begin of line.
( # Group 1.
.*/ # All characters until last slash.
(? # Conditional expression.
(?!.*\.gif$) # If line doesn't end with '.gif', match...
.*) # ... until end of line.
)
Testing...
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[%s\n], $1 if m|\A(.*/(?(?!.*\.gif$).*))|;
}
__DATA__
/dir-images/no1/top-left.gif
/test-1/test-2/test
Run it like:
perl script.pl
And following result:
/dir-images/no1/
/test-1/test-2/test