Perl ignore whitespace on replacement side of regular expression substitution - regex

Suppose I have $str = "onetwo".
I would like to write a reg ex substitution command that ignores whitespace (which makes it more readable):
$str =~ s/
one
two
/
three
four
/x
Instead of "threefour", this produces "\nthree\nfour\n" (where \n is a newline). Basically the /x option ignores whitespace for the matching side of the substitution but not the replacement side. How can I ignore whitespace on the replacement side as well?

s{...}{...} is basically s{...}{qq{...}}e. If you don't want qq{...}, you'll need to replace it with something else.
s/
one
two
/
'three' .
'four'
/ex
Or even:
s/
one
two
/
clean('
three
four
')
/ex
A possible implementation of clean:
sub clean {
my ($s) = #_;
$s =~ s/^[ \t]+//mg;
$s =~ s/^\s+//;
$s =~ s/\s+\z//;
return $s;
}

Related

Perl regular expression that only keeps characters until first newline

When I try with a regular character instead of a newline such as X, the following works:
my $str = 'fooXbarXbaz';
$str =~ s/X.*//;
print $str;
→ foo
However, when the character to look for is \n, the code above fails:
my $str = "foo\nbar\nbaz";
$str =~ s/\n.*//;
print $str;
→ foo
baz
Why is this happening? It looks like . only matches b, a, and r, but not the second \n and the rest of the string.
The . metacharacter does not match a newline unless you add the /s switch. This way it matches everything after the newline even if there is another newline.
$str =~ s/\n.*//s;
Another way to do this is to match a character class that matches everything, such as digits and non-digits:
$str =~ s/\n[\d\D]*//;
Or, the character class of the newline and not a newline:
$str =~ s/\n[\n\N]*//;
I'd be tempted to do this with simpler string operations. Split into lines but only save the first one:
$str = ( split /\n/, $str, 2 )[0]
Or, a substring up to the first newline:
$str = substr $str, 0, index($str, "\n");
And, I'm not recommending this, but sometimes I'll open a file handle on a reference to a scalar so I can read its contents line by line:
open my $string_fh, '<', \ $str;
my $line = <$string_fh>;
close $string_fh;
say $line;

Perl - Combine multiple regexps without renumbering?

I need to combine multiple regexps into one, so code which looks like this:
my $s = "jump 0xbdf3487";
#my $s = "move 0xbdf3487";
if ($s =~ m/^(move) ([^ ]+)/) { print "matched '$1' '$2'\n"; }
if ($s =~ m/^(jump) ([^ ]+)/) { print "matched '$1' '$2'\n"; }
if ($s =~ m/^(call) ([^ ]+)/) { print "matched '$1' '$2'\n"; }
becomes:
my $s = "jump 0xbdf3487";
#my $s = "move 0xbdf3487";
my #patterns = (
'^(move) ([^ ]+)',
'^(jump) ([^ ]+)',
'^(call) ([^ ]+)'
);
my $re = "(?:" . join("|", #patterns) . ")";
$re = qr/$re/;
if ($s =~ m/$re/) { print "matched '$1' '$2'\n"; }
This doesn't work however, if $s is a jump we get:
matched '' ''
Matches in the combined regexp get renumbered:
($1, $2) become ($3, $4) in the jump regexp, ($5, $6) in the call one etc..
How do I combine these without renumbering ?
Use the branch reset pattern (?|pattern) (you'll need Perl 5.10 or newer though). Quoting the documentation (perldoc perlre):
This is the "branch reset" pattern, which has the special property that the capture groups are numbered from the same starting point in each alternation branch.
Your code becomes:
use strict;
use warnings;
my $s = "jump 0xbdf3487";
#my $s = "move 0xbdf3487";
my #patterns = (
'(move) ([^ ]+)',
'(jump) ([^ ]+)',
'(call) ([^ ]+)'
);
my $re = "^(?|" . join("|", #patterns) . ")";
$re = qr/$re/;
if ($s =~ m/$re/) { print "matched '$1' '$2'\n"; }
Note that I've added use strict and use warnings, don't forget them!
You can use simple alternation in your regex and use just a single regex:
m/^(move|jump|call) ([^ ]+)/
Code:
my $s = "jump 0xbdf3487";
if ($s =~ m/^(move|jump|call) ([^ ]+)/) {
print "matched '$1' '$2'\n";
}
Perl Regex subpatterns can be joined together with pipes to make them alternating patterns. To separate alternating patterns from the rest of the expression pattern, delimit them as a group. If you don't want to capture what was matched by the group, make it a non-capturing group.
For example, alternation in a capturing group within a pattern:
(move|jump|call) ([^ ]+)
And alternation in a non-capturing group within a pattern:
(?:move|jump|call) ([^ ]+)
If your alternative patterns are complicated and you don't want them all on one line, you can use the /x modifier to separate them with whitespace.
Perldoc PerlRe Modifiers (scroll down to "Details on some modifiers")
/x
/x tells the regular expression parser to ignore most whitespace that
is neither backslashed nor within a bracketed character class. You can
use this to break up your regular expression into (slightly) more
readable parts. Also, the "#" character is treated as a metacharacter
introducing a comment that runs up to the pattern's closing delimiter,
or to the end of the current line if the pattern extends onto the next
line. Hence, this is very much like an ordinary Perl code comment.
(You can include the closing delimiter within the comment only if you
precede it with a backslash, so be careful!)
Use of /x means that if you want real whitespace or "#" characters in
the pattern (outside a bracketed character class, which is unaffected
by /x), then you'll either have to escape them (using backslashes or
\Q...\E ) or encode them using octal, hex, or \N{} escapes. It is
ineffective to try to continue a comment onto the next line by
escaping the \n with a backslash or \Q.
And here's my example demonstrating that:
#!/usr/bin/perl
use strict;
use warnings;
my $s = "jump 0xbdf3487";
if ($s =~ /^(
move # first complicated pattern
|
jump # second complicated pattern
|
call # third complicated pattern
)\s([^\ ]+) /x) { # Note I hade to escape the space
# with a backslash because of /x
print "matched '$1' '$2'\n";
}
Which outputs:
matched 'jump' '0xbdf3487'

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

How to replace stuff in a Perl regex

I have a string $text and want to modify it with a regex. The string contains multiple sections like <NAME>John</NAME>.
I want to search for those sections, which I would normally do with something like
$text =~ m/<NAME>(.*?)<\/NAME>/g
but then make sure that there are no leading and trailing blanks and no leading non-word characters, which I would normally ensure with something like
$temp =~ s/^\s+|\s+$//g; # trim leading and trailing whitespaces
$temp = s/^\W*//g; # remove all leading non-word chars
Now my question is: How do I actually make this happen? Is it possible to use a s/// regex instead of the m//?
This is possible in a single substitution, but it's unnecessarily complex. I suggest you do a two-tier substitution using a executable replacement.
my $text = '<NAME> %^John^%
</NAME>';
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{
(my $new = $1) =~ s/\A\s+|\s+\z//g;
$new =~ s/\A\W+//;
$new;
}eg;
print $text;
output
<NAME>John^%</NAME>
This is even simpler if you have version 14 or later of Perl 5, and want to use the non-destructive ( /r modifier) substitution mode.
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{ $1 =~ s/\A\s+|\s+\z//gr =~ s/\A\W+//r }exg;
If I understand correctly, what you want to do is merely "clean up" the text inside the tag (insofar as it's possible to "parse" XML using regular expressions). This should do the trick:
$text =~ s/(<NAME>)\s*\W*(.*?)\s*(<\/NAME>)/$1$2$3/sgi;

Multiple substitutions with a single regular expression in perl

Say I have the following in perl:
my $string;
$string =~ s/ /\\ /g;
$string =~ s/'/\\'/g;
$string =~ s/`/\\`/g;
Can the above substitutions be performed with a single combined regular expression instead of 3 separate ones?
$string =~ s/([ '`])/\\$1/g;
Uses a character class [ '`] to match one of space, ' or ` and uses brackets () to remember the matched character. $1 is then used to include the remembered character in the replacement.
Separate substitutions may be much more efficient than a single complex one (e.g. when working with fixed substrings). In such cases you can make the code shorter, like this:
my $string;
for ($string) {
s/ /\\ /g;
s/'/\\'/g;
s/`/\\`/g;
}
Although it's arguably easier to read the way you have it now, you can perform these substitutions at once by using a loop, or combining them in one expression:
# loop
$string =~ s/$_/\\$_/g foreach (' ', "'", '`');
# combined
$string =~ s/([ '`])/\\$1/g;
By the way, you can make your substitutions a little easier to read by avoiding "leaning toothpick syndrome", as the various regex operators allow you to use a variety of delimiters:
$string =~ s{ }{\\ }g;
$string =~ s{'}{\\'}g;
$string =~ s{`}{\\`}g;