substition regex, with capture - regex

maybe this is a stupid question but :
i run perl 5.8.8 and i need to replace any underscore preceded by a number, with "0".
running : $var =~s /(\d)_/$10/g;
obviously does not work as $10 is interpreted as... well... $10, not "$1 followed by 0"
moreover, as runing perl5.8, i can't do
$var=~s/(?<n1>\d)\_/$+{n1}0/g;
any idea ?
thanks in advance

Just like in various Unix shells, you can enclose the variable name in braces for disambiguation.
$var =~s /(\d)_/${1}0/g;
Or you can use a look-behind to prevent the digit from being part of the match:
$var =~s /(?<=\d)_/0/g;

This would also be a good place for a zero width look-behind assertion:
$var =~ s/(?<=\d)_/0/g;
It looks for a digit without actually slurping the digit into the matched text.

$var =~s/(\d)_/${1}0/g;

Another possibilities are (not sure if applicable to perl 5.8.8)
s/\d\K_/0/
s/(?<=\d)_/0/

Related

How to match repetition of a word in regular expression of perl

The situation is very simple. The word "gat" may appear 0 or 1 time in a string. How can I write regex to match it?
Right now I can only use the following to do what I want. It works in my situation, though it would also match "ga", "at" etc.
$str =~ m/(g?a?t?)/
I guess there is a much easier expression to use "?" on the word "gat", but I tried "{}" and it doesn't work.
Thanks!
Use a Non-capturing Group and the ? quantifier
$str =~ m/...(?:gat)?.../
Can also be written as:
$str =~ m/...(?:gat){0,1}.../
.*?(\b(?:gat)\b)?
Try this.This will give all gat.
http://regex101.com/r/pP3pN1/33

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.
The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.
This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

Perl: remove a part of string after pattern

I have strings like this:
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
And I want to split all after the first number, like this:
trn_425374
trn_12
trn_2003
I tried the following code:
$string =~ s/(?<=trn_\d)\d+//gi;
But returns the same as the input. I have been following examples of similar questions but I don't know what I'm doing wrong. Any suggestion?
If you are running Perl 5 version 10 or later then you have access to the \K ("keep") regular expression escape. Everything before the \K is excluded from the substitution, so this removes everything after the first sequence of digits (except newlines)
s/\d+\K.+//;
with earlier versions of Perl, you will have to capture the part of the string you want to keep, and replace it in the substitution
s/(\D*\d+).+/$1/;
Note that neither of these will remove any trailing newline characters. If you want to strip those as well, then either chomp the string first, or add the /s modifier to the substitution, like this
s/\d+\K.+//s;
or
s/(\D*\d+).+/$1/s;
Do grouping to save first numbers of digits found and use .* to delete from there until end of line:
#!/usr/bin/env perl
use warnings;
use strict;
while ( <DATA> ) {
s/(\d+).*$/$1/ && print;
}
__DATA__
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
It yields:
trn_425374
trn_12
trn_2003
your regexr should be:
$string =~ s/(trn_\d+).*/$1/g;
It substitutes the whole match by the memorized at $1 (which is the string part you want to preserve)
Use \K to preserve the part of the string you want to keep:
$string =~ s/trn_\d+\K.*//;
To quote the link above:
\K
This appeared in perl 5.10.0. Anything matched left of \K is not
included in $& , and will not be replaced if the pattern is used in a
substitution.

Is this line of Perl meaningless? s/^(\d+)\b/$1/sg

Does this line of Perl really do anything?
$variable =~ s/^(\d+)\b/$1/sg;
The only thing I can think of is that $1 or $& might be re-used, but it is immediately followed by.
$variable =~ s/\D//sg;
With these two lines together, is the first line meaningless and removable? It seems like it would be, but I have seen it multiple times in this old program, and wanted to make sure.
$variable =~ s/^(\d+)\b/$1/sg;
The anchor ^ at the beginning makes the /g modifier useless.
The lack of the wildcard character . in the string makes the /s modifier useless, since it serves to make . also match newline.
Since \b and ^ are zero-width assertions, and the only things outside the capture group, this substitution will not change the variable at all.
The only thing this regex does is capture the digits into $1, if they are found.
The subsequent regex
$variable =~ s/\D//sg;
Will remove all non-digits, making the variable just one long number. If one wanted to separate the first part (matched by the first regex), the only way to do so would be by accessing $1 from the first regex.
However, the first regex in that case would be better written simply:
$variable =~ /^(\d+)\b/;
And if the capture is supposed to be used:
my ($num) = $variable =~ /^(\d+)\b/;
Is "taint mode" in use? (Script is invoked with -T option.)
Maybe it's used to sanitize (i.e. untaint) user input.

Why does perl ignore extra characters in my regex?

I have this line in bash:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print;}'
and get:
a=-1
Wait..I didn't say perl should match on the -. I made a minor change to:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*$/){print;}'
and it correctly ignores the line. Why?
You might find it useful, while developing a regex, to print only the part of the string that the regex actually matched, instead of the entire line. This will give you better insight into what your regex is doing. You can do this with the special $& variable. So instead of:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print;}'
use
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print $&;}'
You will now get different output:
a=
And this new information may give you a head start in understanding how your regex is [mis]behaving with regards to the input data.
[0-9]* can match the empty string. When you anchored to the end of the string, you prevented this empty match.
You probably want to say [0-9]+ to mean "at least one digit".
The first example, you say that [0-9] can happen "*" times, that means zero or more (so it matches only the "=". When you added that "$" it doesnt match anymore because it doesn't end after the [0-9].