RegEx for capping multiple groups between two words - regex

Consider the following strings:
targethelloluketestlukeluketestluktestingendtarget
sourcehelloluketestlukeluketestluktestingendsource
I want to replace all instances of luke with something else, but only if it's between target...endtarget, not when it's between source...nonsource. The result should be that all three instances of luke in the top string are replaced with whatever I want.
I got this far, but this will only cap one instance of luke. How do I replace all of them?
(?<=target)(?:.*?(luke).*?)(?=target)
SOLUTION
Thanks to the help of this great community, I arrived at the following solution. I find RegEx really convoluted when it comes to this, but in PHP the following works great and is a lot easier to understand:
function replaceBetweenTags($starttag, $endtag, $replace, $with, $text) {
$starttag = escapeStringToRegEx($starttag);
$endtag = escapeStringToRegEx($endtag);
$text = preg_replace_callback(
'/' . $starttag . '.*?' . $endtag . '/',
function ($matches) use ($replace, $with) {
return str_replace($replace, $with, $matches[0]);
},
$text
);
return $text;
}
function escapeStringToRegEx($string)
{
$string = str_replace('\\', '\\\\', $string);
$string = str_replace('.', '\.', $string);
$string = str_replace('^', '\^', $string);
$string = str_replace('$', '\$', $string);
$string = str_replace('*', '\*.', $string);
$string = str_replace('+', '\+', $string);
$string = str_replace('-', '\-', $string);
$string = str_replace('?', '\?', $string);
$string = str_replace('(', '\(', $string);
$string = str_replace(')', '\)', $string);
$string = str_replace('[', '\[', $string);
$string = str_replace(']', '\]', $string);
$string = str_replace('{', '\{', $string);
$string = str_replace('}', '\}', $string);
$string = str_replace('|', '\|', $string);
$string = str_replace(' ', '\s', $string);
$string = str_replace('/', '\/', $string);
return $string;
}
I'm aware of the fact that the escapeStringToRegEx is really quick and dirty, and maybe not even entirely correct, but it's a good starting point to work from.

Here is a solution using a PHP regex callback function:
$input = "luke is here and targethelloluketestlukeluketestluktestingendtarget and luke is also here";
$output = preg_replace_callback(
"/target.*?endtarget/",
function ($matches) {
return str_replace("luke", "peter", $matches[0]);
},
$input
);
echo $output;
This prints:
luke is here and targethellopetertestpeterpetertestluktestingendtarget and luke is also here
Note that occurrences of luke have been replaced with peter only inside the target ... endtarget bounds.

You can use
(?:\G(?!\A)|target)(?:(?!luke|(?:end)?target).)*\Kluke(?=(?:(?!(?:end)?target).)*endtarget)
See the regex demo. If the string has line breaks, you need to use the s flag, or prepend the pattern with (?s) inline PCRE_DOTALL modifier.
Regex details:
(?:\G(?!\A)|target) - either the end of the previous successful match or target string
(?:(?!luke|(?:end)?target).)* - any one char, zero or more occurrences but as many as possible that is not a starting point for the endtarget, target or `luke char sequence
\K - a match reset operator that discards the text matched so far
luke - string to replace
(?=(?:(?!(?:end)?target).)*endtarget) - a positive lookahead that matches a location that must be immediately followed with
(?:(?!(?:end)?target).)* - any one char, zero or more occurrences but as many as possible that is not a starting point for the endtarget or target char sequence
endtarget - an endtarget string.
If you can use preg_replace_callback, use it:
preg_replace_callback('/target.*?endtarget/s', function ($m) {
return str_replace("luke", "<SOME>", $m[0]);
}, $input)
Or, unrolling the loop:
preg_replace_callback('/target[^e]*(?:e(?!ndtarget)[^e]*)*endtarget/', function ($m) {
return str_replace("luke", "<SOME>", $m[0]);
}, $input)

Related

Matching a variable in a string in Perl from the end

I want to match a variable character in a given string, but from the end.
Ideas on how to do this action?
for example:
sub removeCharFromEnd {
my $string = shift;
my $char = shift;
if($string =~ m/$char/){ // I want to match the char, searching from the end, $doesn't work
print "success";
}
}
Thank you for your assistance.
There is no regex modifier that would force Perl regex engine to parse the string from right to left. Thus, the most convenient way to achieve that is via a negative lookahead:
m/$char(?!.*$char)/
The (?!.*$char) negative lookahead will require the absence (=will fail the match if found) of a $char after any 0+ chars other than linebreak chars (use s modifier if you are running the regex against a multiline string input).
The regex engine works from left to right.
You can use the natural greediness of quantifiers to reach the end of the string and find the last char with the backtracking mechanism:
if($string =~ m/.*\K$char/s) { ...
\K marks the position of the match result beginning.
Other ways:
you can also reverse the string and use your previous pattern.
you can search all occurrences and take the last item in the list
I'm having trouble understanding what you want. Your subroutine is called removeCharFromEnd, so perhaps you want to remove $char from $string if it appears at the end of the string
You can do that like this
sub removeCharFromEnd {
my ( $string, $char ) = #_;
if ( $string =~ s/$char\z// ) {
print "success";
}
$string;
}
Or perhaps you want to remove the last occurrence of $char wherever it is. You can do that with
s/.*\K$char//
The subroutine I have written returns the modified string, so you would have to assign the result to a variable to save it. You can write
my $s = 'abc';
$s = removeCharFromEnd($s, 'c');
say $s;
output
ab
If you just want to modify the string in place then you should write
$ARGV[0] =~ s/$char\z//
using whichever substitution you choose. Then you can do this
my $s = 'abc';
removeCharFromEnd($s, 'c');
say $s;
This produces the same output
To get Perl to search from the end of a string, reverse the string.
sub removeCharFromEnd {
my $string = reverse shift #_;
my $char = quotemeta reverse shift #_;
$string =~ s/$char//;
$string = reverse $string;
return $string;
}
print removeCharFromEnd(qw( abcabc b )), "\n";
print removeCharFromEnd(qw( abcdefabcdef c )), "\n";
print removeCharFromEnd(qw( !"/$%?&*!"/$%?&* $ )), "\n";

perl regex square brackets and single quotes

Have this string:
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
The data is repeated.
I need to remove the []' characters from the data so it looks like this:
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
I'm also trying to split the data to assign it to variables as seen below:
my($currency, $strike, $tenor, $tenor2,$ado_symbol) = split /,/, $_;
This works for everything but the ['TEST'] section. Should I remove the []' characters first then keep my split the same or is there an easier way to do this?
Thanks
Something that's useful to know is this - that split takes a regex. (It'll even let you capture, but that'll insert into the returned list, which is why I've got (?: for non capturing groups)
I observe your data only has [' right next to the delimiter - so how about:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
chomp;
my #fields = split /(?:\'])?,(?:\[\')?/;
print Dumper \#fields;
}
__DATA__
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
Output:
$VAR1 = [
'ABC',
'-0.5',
'10Y',
'10Y',
'TEST',
'ABC.1000145721ABC',
'-0.5',
'20Y',
'10Y',
'TEST',
'ABC.1000145722'
];
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
$str =~ s/\['|'\]//g;
print $str;
output is
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
Now you can split.
Clean up $ado_symbol after split:
$ado_symbol =~ s/^\['//;
$ado_symbol =~ s/'\]$//;
You can use a global regex match to find all substrings that are not a comma, a single quote, or a square bracket
Like this
use strict;
use warnings 'all';
my $s = q{ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722};
my #data = $s =~ /[^,'\[\]]+/g;
my ( $currency, $strike, $tenor, $tenor2, $ado_symbol ) = #data;
print "\$currency = $currency\n";
print "\$strike = $strike\n";
print "\$tenor = $tenor\n";
print "\$tenor2 = $tenor2\n";
print "\$ado_symbol = $ado_symbol\n";
output
$currency = ABC
$strike = -0.5
$tenor = 10Y
$tenor2 = 10Y
$ado_symbol = TEST
Another alternative
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
my ($currency, $strike, $tenor, $tenor2,$ado_symbol) = map{ s/[^A-Z0-9\.-]//g; $_} split ',',$str;
print "$currency, $strike, $tenor, $tenor2, $ado_symbol",$/;
Output is:
ABC, -0.5, 10Y, 10Y, TEST

How to get the expression of matching capture in Perl

In Perl, how can I get the expression of a capture that has matched in a regex?
$s = 'aaazzz';
$s =~ s/(a+)|(b+)|(c+)/.../;
$s =~ s/(?<one>a+)|(?<two>b+)|(?<three>c+)/.../;
I mean the expression (e.g. a+), not the string aaa.
I need the expression of both numbered and named captures.
I'd do something like:
use strict;
use warnings;
my #regexes = (
qr/(a+)/,
qr/(b+)/,
qr/(c+)/,
);
my $string = 'aaazzz';
foreach my $re(#regexes) {
if ($string =~ $re) {
print "Used regex is $re\n";
}
}
Output:
Used regex is (?^:(a+))
You could assemble your regex from components and then test which groups matched. For demo purposes, I have only used match and not match and the replace operator 's', but same principle applies.
$s = 'aaazzz';
$part1 = '(a+)';
if ( $s =~ /$part1|(b+)|(c+)/ ) {
if ($1) {
print("$part1 matched\n");
}
else {
print("$part1 did not match\n");
}
}

Perl idiom for quickly searching file with elements in array

what is the Perl idiom to search a string or a whole file for array elements occurrences? E.g.:
my #array = qw(word, test, ...);
my $string = ".......";
I want to search for word or test (can also be words, tester, etc.) inside $string and return whatever is found (i.e. group match).
I searched the docs, seems like map + grep is what I need but I just can’t come up with the code for it. Perl is such fun that I am totally clueless sometimes. :)
Using one example from map:
my #squares = map { $_ * $_ } grep { $_ > 5 } #numbers;
I suppose I can split the string into array and grep. Am I right?
grep { #array } #string; # something like grep {/(word|test)/} #string but I want to use array
my #word_roots = qw( word test );
my $pat = join '|', map quotemeta, #word_roots;
my $re = qr/\b(?:$pat)\w+\b/;
my #matches = $string =~ /($re)/g;
How about something like this from a re.pl session:
$ my #array = qw(word test)
$VAR1 = 'word';
$VAR2 = 'test';
$ my $string = ' the word is test, I said'
the word is test, I said
$ my #match_array = map { $string =~ /\b($_)\b/ } #array
$VAR1 = 'word';
$VAR2 = 'test';
The parenthesis around \b$_\b capture the match in the regex inside of map.
The \b ensures that we only match is the word is found on its own (like "test" or "word") and not words that contain the characters "test", or "word" in them like "coward" or "brightest". See http://www.regular-expressions.info/wordboundaries.html for more details on \b.

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.