How to replace stuff in a Perl regex - regex

I have a string $text and want to modify it with a regex. The string contains multiple sections like <NAME>John</NAME>.
I want to search for those sections, which I would normally do with something like
$text =~ m/<NAME>(.*?)<\/NAME>/g
but then make sure that there are no leading and trailing blanks and no leading non-word characters, which I would normally ensure with something like
$temp =~ s/^\s+|\s+$//g; # trim leading and trailing whitespaces
$temp = s/^\W*//g; # remove all leading non-word chars
Now my question is: How do I actually make this happen? Is it possible to use a s/// regex instead of the m//?

This is possible in a single substitution, but it's unnecessarily complex. I suggest you do a two-tier substitution using a executable replacement.
my $text = '<NAME> %^John^%
</NAME>';
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{
(my $new = $1) =~ s/\A\s+|\s+\z//g;
$new =~ s/\A\W+//;
$new;
}eg;
print $text;
output
<NAME>John^%</NAME>
This is even simpler if you have version 14 or later of Perl 5, and want to use the non-destructive ( /r modifier) substitution mode.
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{ $1 =~ s/\A\s+|\s+\z//gr =~ s/\A\W+//r }exg;

If I understand correctly, what you want to do is merely "clean up" the text inside the tag (insofar as it's possible to "parse" XML using regular expressions). This should do the trick:
$text =~ s/(<NAME>)\s*\W*(.*?)\s*(<\/NAME>)/$1$2$3/sgi;

Related

Limit the translation to just one word in a phrase?

Coming new to Perl world from Python, and wonder if there is a simple way to limit the translation or replace to just one word in a phrase?
In the example, the 2nd word kind also got changed to lind. Is there a simple way to do the translation without diving into some looping? Thanks.
The first word has been correctly translated to gazelle, but 2nd word has been changed too as you can see.
my $string = 'gazekke is one kind of antelope';
my $count = ($string =~ tr/k/l/);
print "There are $count changes \n";
print $string; # gazelle is one lind of antelope <-- kind becomes lind too!
I don't know of an option for tr to stop translation after the first word.
But you can use a regex with backreferences for this.
use strict;
my $string = 'gazekke is one kind of antelope';
# Match first word in $1 and rest of sentence in $2.
$string =~ m/(\w+)(.*)/;
# Translate all k's to l's in the first word.
(my $translated = $1) =~ tr/k/l/;
# Concatenate the translated first word with the rest
$string = "$translated$2";
print $string;
Outputs: gazelle is one kind of antelope
Pick the first match (a word in this case), precisely what regex does when without /g, and in that word replace all wanted characters, by running code in the replacement side, by /e
$string =~ s{(\w+)}{ $1 =~ s/k/l/gr }e;
In the regex in the replacement side, /r modifier makes it handily return the changed string and doesn't change the original, what also allows a substitution to run on $1 (which can't be modified as is a read-only).
tr is a character class transliterator. For anything else you would use regex.
$string =~ s/gazekke/gazelle/;
You can put a code block as the second half of s/// to do more complicated replacements or transmogrifications.
$string =~ s{([A-Za-z]+)}{ &mangler($1) if $should_be_mangled{$1}; }ge;
Edit:
Here's how you would first locate a phrase and then work on it.
$phrase_regex = qr/(?|(gazekke) is one kind of antelope|(etc))/;
$string =~ s{($phrase_regex)}{
my $match = $1;
my $word = $2;
$match =~ s{$word}{
my $new = $new_word_map{$word};
&additional_mangling($new);
$new;
}e;
$match;
}ge;
Here's the Perl regex documentation.
https://perldoc.perl.org/perlre

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

How to match a question mark?

I am trying to search and replace a list of URLs in a file and I am having problems if the search URL has a question mark in it. The $file below is just a single tag here, but it is usually an entire file.
my $search = 'http://shorturl.com/detail.cfm?color=blue';
my $replace = 'http://shorturl.com/detaila.aspx?color=red';
my $file = 'HI';
$file =~ s/$search/$replace/gis;
print $file;
If the $search variable has ? in it the substitution does not work. It would work if I were to take off the ?color=blue from the $search variable.
Does anyone know how to make the above substitution work? Backslashing, i.e. \? did not help. Thanks.
Use quotemeta for the regex pattern.
use warnings;
use strict;
my $search = quotemeta 'http://shorturl.com/detail.cfm?color=blue';
my $replace = 'http://shorturl.com/detaila.aspx?color=red';
my $file = 'HI';
$file =~ s/$search/$replace/gis;
print $file;
__END__
HI
When a string is interpolated as a regex, it isn't matched literally, but interpreted as a regex. This is useful to build complex regexes, e.g.
my #animals = qw/ cat dog goldfish /;
my $animal_re = join "|", #animals;
say "The $thing is an animal" if $thing =~ /$animal_re/i;
In the string $animal_re, the | is treated as a regex metacharacter.
Other metacharacters are e.g. ., which matches any non-newline character, or ?, which makes the previous atom optional.
If you want to match the contents of a variable literally, you can enclose it in \Q...\E quotes:
s/\Q$search/$replace/gi
(The /s option just changes the meaning of . from “match any non-newline character” to “match any character”, and is therefore irrelevant here.)
The \Q...\E is syntactic sugar for the quotemeta function, therefore this answer and toolic's answer are exactly equivalent.
Please note that you want to escape more than just the ?. The ? is the only one in your example that messes up what you're expecting, but the . matching can be insidious to find.
The regex /foo.com/ will indeed match the string foo.com, but it will also match foo com and fooXcom and foo!com, because . matches any character. Therefore, the /foo.com/ should be written as /foo\.com/.

Perl ignore whitespace on replacement side of regular expression substitution

Suppose I have $str = "onetwo".
I would like to write a reg ex substitution command that ignores whitespace (which makes it more readable):
$str =~ s/
one
two
/
three
four
/x
Instead of "threefour", this produces "\nthree\nfour\n" (where \n is a newline). Basically the /x option ignores whitespace for the matching side of the substitution but not the replacement side. How can I ignore whitespace on the replacement side as well?
s{...}{...} is basically s{...}{qq{...}}e. If you don't want qq{...}, you'll need to replace it with something else.
s/
one
two
/
'three' .
'four'
/ex
Or even:
s/
one
two
/
clean('
three
four
')
/ex
A possible implementation of clean:
sub clean {
my ($s) = #_;
$s =~ s/^[ \t]+//mg;
$s =~ s/^\s+//;
$s =~ s/\s+\z//;
return $s;
}

Multiple substitutions with a single regular expression in perl

Say I have the following in perl:
my $string;
$string =~ s/ /\\ /g;
$string =~ s/'/\\'/g;
$string =~ s/`/\\`/g;
Can the above substitutions be performed with a single combined regular expression instead of 3 separate ones?
$string =~ s/([ '`])/\\$1/g;
Uses a character class [ '`] to match one of space, ' or ` and uses brackets () to remember the matched character. $1 is then used to include the remembered character in the replacement.
Separate substitutions may be much more efficient than a single complex one (e.g. when working with fixed substrings). In such cases you can make the code shorter, like this:
my $string;
for ($string) {
s/ /\\ /g;
s/'/\\'/g;
s/`/\\`/g;
}
Although it's arguably easier to read the way you have it now, you can perform these substitutions at once by using a loop, or combining them in one expression:
# loop
$string =~ s/$_/\\$_/g foreach (' ', "'", '`');
# combined
$string =~ s/([ '`])/\\$1/g;
By the way, you can make your substitutions a little easier to read by avoiding "leaning toothpick syndrome", as the various regex operators allow you to use a variety of delimiters:
$string =~ s{ }{\\ }g;
$string =~ s{'}{\\'}g;
$string =~ s{`}{\\`}g;