Why does perl not keep the match variable around in this situation? - regex

I just struggled a long time to come up with a working little perl one-liner like this:
perl -pe 'if (/^(".*",").*, /) { $a = $1; s/, /"\n$a/g}'
My input data looks something like this:
"foo","bar a"
"baz","bar a, bar b, bar c"
And I'm transforming it to this:
"foo","bar a"
"baz","bar a"
"baz","bar b"
"baz","bar c"
Basically I wanted to only match certain lines (if (/, /)...) and on those lines replace all instances of that match with a part of the original line. A s///g with a match group would not work because it would not recurse properly, the replacement string has to be figured out before replacements start happening.
if (/^(".*",").*, /) { s/, /"\n$1/g}
Yet it did not. The var $1 was never anything but empty. Given what the perl docs I read said about persistence, this was a surprise to me:
These match variables generally stay around until the next successful pattern match.
Only when I started stashing the result in a variable of my own could I access the result from the substitution expression:
if (/^(".*",").*, /) { $a = $1; s/, /"\n$a/g}
Why was $1 being cleared when not only was there no successful match, there was no request for a match at all in my search and replace? And would there have been a better way to approach this problem?

The values of match variables do indeed stay around until the next successful pattern match (or until the scope in which the match occurred is exited).
In your case, they changed because there was a successful pattern match. You successfully matched against the pattern , . The capture variables will therefore reflect the text captured by the captures of that match. $1 returns the text matched by the non-existent first capture, so it returned undef.
$ perl -e'
$_ = "a";
s/(a)/a/; CORE::say $1 // "[undef]"; # Successful match
s/(c)/c/; CORE::say $1 // "[undef]"; # Unsuccessful match
s/a/a/; CORE::say $1 // "[undef]"; # Successful match
'
a
a
undef

You ask:
Why was $1 being cleared when not only was there no successful match, there was no request for a match at all in my search and replace?
Are you perhaps conflating matching and capturing?
For s/PATTERN/REPLACEMENT/ to do anything, the PATTERN must match. So if there is any substitution at all as a result of an s/// operation, you know that its PATTERN regex-matched successfully. REPLACEMENT is then evaluated.
(In your case, the s/, /.../ PATTERN matches at least once on the comma and space after the text bar a in your second input line.)
Of course, when that happens, the interpreter will reset all the capture elements ($1, $2, etc.) to whatever PATTERN captured. Again, this is before REPLACEMENT is evaluated. Since your PATTERN doesn't capture anything, those elements are undefined, just as they would be if you had explicitly done a non-capturing m/, / match.

Related

How to reconstruct regex matched part

I have simplify some latex math formula within text, for example
This is ${\text{BaFe}}_{2}{\text{As}}_{2}$ crystal
I want to transform this into
This is BaFe2As2 crystal
That is to concatenate only content within inner most bracket.
I figure out that I can use regex pattern
\{[^\{\}]*\}
to match those inner most bracket. But the problem is how to concatenate them together?
I don't know if this could be done in notepad++ regex replacement. If notepad++ is not capable, I can also accept perl one-liner solution.
There may clearly be multiple such equations (the markup between two $s) in the document. So while you need to assemble text between all {}, this also need be constrained within a $ pair. Then all such equations need be processed.
Matching that in a single pattern results in a complex regex. Instead, we can first extract everything within a pair of $s and then gather text within {}s from that, simplifying the regex a lot. This makes two passes over each equation but a Latex document is small for computational purposes and the loss of efficiency can't be noticed.
use warnings;
use strict;
use feature 'say';
my $text = q(This is ${\text{BaFe}}_{2}{\text{As}}_{2}$ crystal,)
. q( and ${\text{Some}}{\mathbf{More}}$ text);
my #results;
while ($text =~ /\$(.*?)\$/g) {
my $eq = $1;
push #results, join('', $eq =~ /\{([^{}]+)\}/g);
}
say for #results;
This prints lines BaFe2As2 and SomeMore.
The regex in the while condition captures all chars between two $s. After the body of the loop executes and the condition is checked again, the regex continues searching the string from the position of the previous match. This is due to the "global" modifier /g in scalar context, imposed on regex since it is in the loop condition. Once there are no more matches the loop terminates.
In the body we match between {}, and again due to /g this is done for all {}s in the equation. Here, however, the regex is in the list context (as it is assigned to an array) and then /g makes it return all matches. They are joined into a string, which is added to the array.
In order to replace the processed equation, use this in a substitution instead
$text =~ s{ \$(.*?)\$ }{ join('', $1 =~ /\{([^{}]+)\}/g) }egx;
where the modifier e makes it so that the replacement part is evaluated as Perl code, and the result of that used to replace the matched part. Then in it we can run our regex to match content of all {} and join it into the string, as explained above. I use s{}{} delimiters, and x modifier so to be able to space things in the matching part as well.
Since the whole substitution has the g modifier the regex keeps going through $text, as long as there are equations to match, replacing them with what's evaluated in the replacement part.
I use a hard-coded string (extended) from the question, for an easy demo. In reality you'd read a file into a scalar variable ("slurp" it) and process that.
This relies on the question's premise that text of interest in an equation is cleanly between {}.
Missed the part that a one-liner is sought
perl -0777 -wnE'say join("", $1=~/\{([^{}]+)\}/g) while /\$(.*?)\$/g' file.tex
With -0777 the file is read whole ("slurped"), and as -n provides a loop over input lines it is in the $_ variable; the regex in the while condition works by default on $_. In each interation of while the contents of the captured equation, in $1, is directly matched for {}s.
Then to replace each equation and print out the whole processed file
perl -0777 -wne's{\$(.*?)\$}{join "", $1=~/\{([^{}]+)\}/g}eg; print' file.tex
where I've removed extra spaces and (unnecessary) parens on join.
Use this regex in Notepad++. I have tried to match everything which is NOT present between the innermost curly brackets and then replaced the match with a blank string.
[^{}]*\{|\}[^{}]*
Click for Demo
Explanation:
[^{}]*\{ - matches 0+ occurrences of any character that is neither { nor } followed by {
| - OR
\}[^{}]* - matches } followed by 0+ occurrences of any character that is neither { nor }
Before Replacement:
After Replacement:
UPDATE:
Try this updated regex:
\$?(?=[^$]*\$[^$]*$)(?:[^{}]*{|}[^{}]*)(?=[^$]*\$[^$]*$)\$?
Click for Demo

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

perl regex substring

$str="!bypass";
I need return string that only start with regex "!"
How can I return bypass ?
To match strings that start with a ! you need this pattern. The ^ is the anchor at the beginning of the string.
/^!/
If you want to capture the stuff after the !, you need this pattern. The parenthesis () are a capture group. They tell Perl to grab everything between them and keep it. The . means any character, and the + is a quantifier for as many as possible, at least one. So .+ means grab everything.
/^!(.+)/
To apply it, do this.
$str =~ m/^!(.+)/;
And to get the "bypass" out of that pattern, use the $1 match variable that was assigned automatically by Perl with the m// operation.
print $1; # will print bypass
To make that conditional, it would be:
print $1 if $str =~ m/^!(.+)/;
The if here is in post-fix notation, which lets you omit the block and the parenthesis. It's the same as the following, but shorter and easier to read for single statements.
if ( $str =~ m/^!(.+)/ ) {
print $1;
}
If you want to permanently change $str to not have an exclamation mark at the beginning, you need to use a substitution instead.
$str =~ s/^!//;
The s/// is the substitution operator. It changes $str in place. The original value including the ! will be lost.
Use ^!\K.+.
It works this way:
^! - Match initial ! (but this will soon change, see below).
\K - Keep - "forget" about what you have matched so far and set the starting point of the match here (after the !).
.+ - Match non-empty sequence of chars.
Due to \K, only the last part (.+) is actually matched.

Why is Perl lazy when regex matching with * against a group?

In perl, the * is usually greedy, unless you add a ? after it. When * is used against a group, however, the situation seems different. My question is "why". Consider this example:
my $text = 'f fjfj ff';
my (#matches) = $text =~ m/((?:fj)*)/;
print "#matches\n";
# --> ""
#matches = $text =~ m/((?:fj)+)/;
print "#matches\n";
# --> "fjfj"
In the first match, perl lazily prints out nothing, though it could have matched something, as is demonstrated in the second match. Oddly, the behavior of * is greedy as expected when the contents of the group is just . instead of actual characters:
#matches = $text =~ m/((?:..)*)/;
print "#matches\n";
# --> 'f fjfj f'
Note: The above was tested on perl 5.12.
Note: It doesn't matter whether I use capturing or non-capturing parentheses for inside group.
This isn't a matter of greedy or lazy repetition. (?:fj)* is greedily matching as many repetitions of "fj" as it can, but it will successfully match zero repetitions. When you try to match it against the string "f fjfj ff", it will first attempt to match at position zero (before the first "f"). The maximum number of times you can successfully match "fj" at position zero is zero, so the pattern successfully matches the empty string. Since the pattern successfully matched at position zero, we're done, and the engine has no reason to try a match at a later position.
The moral of the story is: don't write a pattern that can match nothing, unless you want it to match nothing.
Perl will match as early as possible in the string (left-most). It can do that with your first match by matching zero occurrences of fj at the start of the string

Perl regex with grouping in LHS

How to can match the next lines?
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
want remove the - repetative.text from the end, but only if it repeats.
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
my trying
use strictures;
my $text="sometext_TEXT1.xxx-TEXT1.xxx";
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
print "$text\n";
prints
Use of uninitialized value $2 in regexp compilation at a line 3.
with other words, looking for better solution for the next split + match...
while(<DATA>) {
chomp;
my($first, $second) = split /\s*-\s*/;
s/\s*-\s*$second$// if ( $first =~ /$second$/ );
print "$_\n";
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
This regex has various issues, but is on the right path.
Use \2 (or better: \g2 or \g{-1}) or something to reference the contents of a capture group. The $2 variable is interpolated when the Perl statement is executed. At that time, $2 is undefined, as there was no previous match. You get a warning as it is uninitialized. Even if it were defined, the pattern would be fixed during compilation.
You define three capture groups, but only need one. There is a trick with the \Keep directive: It let's the regex engine forget the previously matched text, so that it won't be affected by the substitution. That is, s/(foo)b/$1/ is equivalent to s/foo\Kb//. The effect is similar to a variable-length lookbehind.
The (.*?)(.*) part is a bit of an backtracking nightmare. We can reduce the cost of your match by adding further conditions, e.g. by anchoring the pattern at start and end of line. Using above modifications, we now have s/^.*?(.*)\K\s*-\s*\g1$//. But on second thought, we can just remove the ^.*? because this describes something the regex engine does anyway!
A short test:
while(<DATA>) {
s/(.*)\K\s*-\s*\g1$//;
print;
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
Output:
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
A few words regarding your splitting solution: This will also shorten the line
sometext_TEXT1xyyy - 1.xyyy
because when you interpolate a variable into a regex, the contents aren't matched literally. Instead, they are interpreted as a pattern (where . matches any non-newline codepoint)! You can avoid this by quoting all metacharacters with the \Q...\E escape:
s/\s*-\s*\Q$second\E$// if $first =~ /\Q$second\E$/;
When you use $2 Perl will try to interpolate that variable, but the variable will only be set after the match has completed. What you want, is a backreference, for which you need to use \2:
$text =~ s/(.*?)(.*)(\s*-\s*\2)/$1$2/;
Note that, when the replacement part is evaluated, $1 and $2 have been set and can be interpolated as expected. Also you could make the pattern a bit more concise (and probably more efficient), by using:
$text =~ s/(.*)\s*-\s*\2/$1/;
There is no need to match the initial part (.*?) if it's arbitrary and you just write it back anyway. What you might want to do though, is anchor the pattern to the end of the string:
$text =~ s/(.*)\s*-\s*\1$/$1/;
Otherwise (with your initial attempt or mine), you'd turn something-thingelse into somethingelse.