Perl - Combine multiple regexps without renumbering? - regex

I need to combine multiple regexps into one, so code which looks like this:
my $s = "jump 0xbdf3487";
#my $s = "move 0xbdf3487";
if ($s =~ m/^(move) ([^ ]+)/) { print "matched '$1' '$2'\n"; }
if ($s =~ m/^(jump) ([^ ]+)/) { print "matched '$1' '$2'\n"; }
if ($s =~ m/^(call) ([^ ]+)/) { print "matched '$1' '$2'\n"; }
becomes:
my $s = "jump 0xbdf3487";
#my $s = "move 0xbdf3487";
my #patterns = (
'^(move) ([^ ]+)',
'^(jump) ([^ ]+)',
'^(call) ([^ ]+)'
);
my $re = "(?:" . join("|", #patterns) . ")";
$re = qr/$re/;
if ($s =~ m/$re/) { print "matched '$1' '$2'\n"; }
This doesn't work however, if $s is a jump we get:
matched '' ''
Matches in the combined regexp get renumbered:
($1, $2) become ($3, $4) in the jump regexp, ($5, $6) in the call one etc..
How do I combine these without renumbering ?

Use the branch reset pattern (?|pattern) (you'll need Perl 5.10 or newer though). Quoting the documentation (perldoc perlre):
This is the "branch reset" pattern, which has the special property that the capture groups are numbered from the same starting point in each alternation branch.
Your code becomes:
use strict;
use warnings;
my $s = "jump 0xbdf3487";
#my $s = "move 0xbdf3487";
my #patterns = (
'(move) ([^ ]+)',
'(jump) ([^ ]+)',
'(call) ([^ ]+)'
);
my $re = "^(?|" . join("|", #patterns) . ")";
$re = qr/$re/;
if ($s =~ m/$re/) { print "matched '$1' '$2'\n"; }
Note that I've added use strict and use warnings, don't forget them!

You can use simple alternation in your regex and use just a single regex:
m/^(move|jump|call) ([^ ]+)/
Code:
my $s = "jump 0xbdf3487";
if ($s =~ m/^(move|jump|call) ([^ ]+)/) {
print "matched '$1' '$2'\n";
}

Perl Regex subpatterns can be joined together with pipes to make them alternating patterns. To separate alternating patterns from the rest of the expression pattern, delimit them as a group. If you don't want to capture what was matched by the group, make it a non-capturing group.
For example, alternation in a capturing group within a pattern:
(move|jump|call) ([^ ]+)
And alternation in a non-capturing group within a pattern:
(?:move|jump|call) ([^ ]+)
If your alternative patterns are complicated and you don't want them all on one line, you can use the /x modifier to separate them with whitespace.
Perldoc PerlRe Modifiers (scroll down to "Details on some modifiers")
/x
/x tells the regular expression parser to ignore most whitespace that
is neither backslashed nor within a bracketed character class. You can
use this to break up your regular expression into (slightly) more
readable parts. Also, the "#" character is treated as a metacharacter
introducing a comment that runs up to the pattern's closing delimiter,
or to the end of the current line if the pattern extends onto the next
line. Hence, this is very much like an ordinary Perl code comment.
(You can include the closing delimiter within the comment only if you
precede it with a backslash, so be careful!)
Use of /x means that if you want real whitespace or "#" characters in
the pattern (outside a bracketed character class, which is unaffected
by /x), then you'll either have to escape them (using backslashes or
\Q...\E ) or encode them using octal, hex, or \N{} escapes. It is
ineffective to try to continue a comment onto the next line by
escaping the \n with a backslash or \Q.
And here's my example demonstrating that:
#!/usr/bin/perl
use strict;
use warnings;
my $s = "jump 0xbdf3487";
if ($s =~ /^(
move # first complicated pattern
|
jump # second complicated pattern
|
call # third complicated pattern
)\s([^\ ]+) /x) { # Note I hade to escape the space
# with a backslash because of /x
print "matched '$1' '$2'\n";
}
Which outputs:
matched 'jump' '0xbdf3487'

Related

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

I have a perl regex which converts hyphens to spaces eg:-
$string =~ s/-/ /g;
I need to modify this to ignore specific hyphenated phrases and not replace the hyphen e.g. in a string like this:
"use-either-dvi-d-or-dvi-i"
I wish to NOT replace the hyphen in dvi-d and dvi-i so it reads:
"use either dvi-d or dvi-i"
I have tried various negative look ahead matches but failed miserably.
You can use this PCRE regex with verbs (*SKIP)(*F) to skip certain words from your match:
dvi-[id](*SKIP)(*F)|-
RegEx Demo
This will skip words dvi-i and dvi-d for splitting due to use of (*SKIP)(*F).
For your code:
$string =~ s/dvi-[id](*SKIP)(*F)|-/ /g;
Perl Code Demo
There is an alternate lookarounds based solution as well:
/(?<!dvi)-|-(?![di])/
Which basically means match hyphen if it is not preceded by dvi OR if it is not followed by d or i, thus making sure to not match - when we have dvi on LHS and [di] on RHS.
Perl code:
$string =~ s/(?<!dvi)-|-(?![di])/ /g;
Perl Code Demo 2
$string =~ s/(?<!dvi)-(?![id])|(?<=dvi)-(?![id])|(?<!dvi)-(?=[id])/ /g;
While using just (?<!dvi)-(?![id]) you will exclude also dvi-x or x-i, where x can be any character.
It is unlikely that you could get a simple and straightforward regex solution to this. However, you could try the following:
#!/usr/bin/env perl
use strict;
use warnings;
my %whitelist = map { $_ => 1 } qw( dvi-d dvi-i );
my $string = 'use-either-dvi-d-or-dvi-i';
while ( $string =~ m{ ( [^-]+ ) ( - ) ( [^-]+ ) }gx ) {
my $segment = substr($string, $-[0], $+[0] - $-[0]);
unless ( $whitelist{ $segment } ) {
substr( $string, $-[2], 1, ' ');
}
pos( $string ) = $-[ 3 ];
}
print $string, "\n";
The #- array contains the starting offsets of matched groups, and the #+ array contains the ends offsets. In both cases, element 0 refers to the whole match.
I had to resort to something like this because of how \G works:
Note also that s/// will refuse to overwrite part of a substitution that has already been replaced; so for example this will stop after the first iteration, rather than iterating its way backwards through the string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
Maybe #tchrist can figure out how to bend various assertions to his will.
we can ignore specific words using negative Look-ahead and negative Look-behind
Example :
(?!pattern)
is a negative look-ahead assertion
in your case the pattern is
$string =~ s/(?<!dvi)-(?<![id])/ /g;
output :
use either dvi-d or dvi-i
Reference : http://www.perlmonks.org/?node_id=518444
Hope this will help you.

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

How to replace stuff in a Perl regex

I have a string $text and want to modify it with a regex. The string contains multiple sections like <NAME>John</NAME>.
I want to search for those sections, which I would normally do with something like
$text =~ m/<NAME>(.*?)<\/NAME>/g
but then make sure that there are no leading and trailing blanks and no leading non-word characters, which I would normally ensure with something like
$temp =~ s/^\s+|\s+$//g; # trim leading and trailing whitespaces
$temp = s/^\W*//g; # remove all leading non-word chars
Now my question is: How do I actually make this happen? Is it possible to use a s/// regex instead of the m//?
This is possible in a single substitution, but it's unnecessarily complex. I suggest you do a two-tier substitution using a executable replacement.
my $text = '<NAME> %^John^%
</NAME>';
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{
(my $new = $1) =~ s/\A\s+|\s+\z//g;
$new =~ s/\A\W+//;
$new;
}eg;
print $text;
output
<NAME>John^%</NAME>
This is even simpler if you have version 14 or later of Perl 5, and want to use the non-destructive ( /r modifier) substitution mode.
$text =~ s{ (?<=<NAME>) ([^<>]*) (?=</NAME>) }{ $1 =~ s/\A\s+|\s+\z//gr =~ s/\A\W+//r }exg;
If I understand correctly, what you want to do is merely "clean up" the text inside the tag (insofar as it's possible to "parse" XML using regular expressions). This should do the trick:
$text =~ s/(<NAME>)\s*\W*(.*?)\s*(<\/NAME>)/$1$2$3/sgi;

Perl Regex OR condition

I have a variable which is like $var = 1.2.3 or $var = Variable/1.2.3. I am trying to write a regex that matches and stores in $1. The code goes as follows:
if ($var =~ m/[\w+\/\d+.\d+.\d+])/){
$a = $1;
}
I want to match in the above condition if any of the $var prevails. Please suggest me. Thank you.
Parentheses capture. It is error-prone to rely on $1, simply use the return value from the match operator.
for my $var ('1.2.3', 'Variable/1.2.3') {
if (my ($version) = $var =~ m{
(?:\A | /) # beginning of string or a slash
(\d+ [.] \d+ [.] \d+) # capture version number triple
\z # end of string
}msx) {
print ">>> $version <<<\n";
}
}
__END__
>>> 1.2.3 <<<
>>> 1.2.3 <<<
You should remove that character class. A character classes matches just a single character. It doesn't represent any sequence. Also, . is a meta-character in regex. To match it literally, you need to escape it.
And then, you need to make the part before / as optional, using ? quantifier, as it is not necessarily present in string:
if ($var =~ m/((?:\w+\/)?\d+\.\d+\.\d+)/){
$a = $1;
}
Just FYI, you can use any delimiter for match operator, so as to avoid escaping /:
m!((?:\w+/)?\d+\.\d+\.\d+)!
Try this:
if ($var =~ m/(?>[a-z_][a-z0-9_]*+\/)?+[0-9]++\.[0-9]++\.[0-9]++/i) {
$a = $&;
}

Perl ignore whitespace on replacement side of regular expression substitution

Suppose I have $str = "onetwo".
I would like to write a reg ex substitution command that ignores whitespace (which makes it more readable):
$str =~ s/
one
two
/
three
four
/x
Instead of "threefour", this produces "\nthree\nfour\n" (where \n is a newline). Basically the /x option ignores whitespace for the matching side of the substitution but not the replacement side. How can I ignore whitespace on the replacement side as well?
s{...}{...} is basically s{...}{qq{...}}e. If you don't want qq{...}, you'll need to replace it with something else.
s/
one
two
/
'three' .
'four'
/ex
Or even:
s/
one
two
/
clean('
three
four
')
/ex
A possible implementation of clean:
sub clean {
my ($s) = #_;
$s =~ s/^[ \t]+//mg;
$s =~ s/^\s+//;
$s =~ s/\s+\z//;
return $s;
}