perl regex substring - regex

$str="!bypass";
I need return string that only start with regex "!"
How can I return bypass ?

To match strings that start with a ! you need this pattern. The ^ is the anchor at the beginning of the string.
/^!/
If you want to capture the stuff after the !, you need this pattern. The parenthesis () are a capture group. They tell Perl to grab everything between them and keep it. The . means any character, and the + is a quantifier for as many as possible, at least one. So .+ means grab everything.
/^!(.+)/
To apply it, do this.
$str =~ m/^!(.+)/;
And to get the "bypass" out of that pattern, use the $1 match variable that was assigned automatically by Perl with the m// operation.
print $1; # will print bypass
To make that conditional, it would be:
print $1 if $str =~ m/^!(.+)/;
The if here is in post-fix notation, which lets you omit the block and the parenthesis. It's the same as the following, but shorter and easier to read for single statements.
if ( $str =~ m/^!(.+)/ ) {
print $1;
}
If you want to permanently change $str to not have an exclamation mark at the beginning, you need to use a substitution instead.
$str =~ s/^!//;
The s/// is the substitution operator. It changes $str in place. The original value including the ! will be lost.

Use ^!\K.+.
It works this way:
^! - Match initial ! (but this will soon change, see below).
\K - Keep - "forget" about what you have matched so far and set the starting point of the match here (after the !).
.+ - Match non-empty sequence of chars.
Due to \K, only the last part (.+) is actually matched.

Related

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

Perl regex with grouping in LHS

How to can match the next lines?
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
want remove the - repetative.text from the end, but only if it repeats.
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
my trying
use strictures;
my $text="sometext_TEXT1.xxx-TEXT1.xxx";
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
print "$text\n";
prints
Use of uninitialized value $2 in regexp compilation at a line 3.
with other words, looking for better solution for the next split + match...
while(<DATA>) {
chomp;
my($first, $second) = split /\s*-\s*/;
s/\s*-\s*$second$// if ( $first =~ /$second$/ );
print "$_\n";
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
$text =~ s/(.*?)(.*)(\s*-\s*$2)/$1$2/;
This regex has various issues, but is on the right path.
Use \2 (or better: \g2 or \g{-1}) or something to reference the contents of a capture group. The $2 variable is interpolated when the Perl statement is executed. At that time, $2 is undefined, as there was no previous match. You get a warning as it is uninitialized. Even if it were defined, the pattern would be fixed during compilation.
You define three capture groups, but only need one. There is a trick with the \Keep directive: It let's the regex engine forget the previously matched text, so that it won't be affected by the substitution. That is, s/(foo)b/$1/ is equivalent to s/foo\Kb//. The effect is similar to a variable-length lookbehind.
The (.*?)(.*) part is a bit of an backtracking nightmare. We can reduce the cost of your match by adding further conditions, e.g. by anchoring the pattern at start and end of line. Using above modifications, we now have s/^.*?(.*)\K\s*-\s*\g1$//. But on second thought, we can just remove the ^.*? because this describes something the regex engine does anyway!
A short test:
while(<DATA>) {
s/(.*)\K\s*-\s*\g1$//;
print;
}
__DATA__
sometext_TEXT1.yyy-TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
Output:
sometext_TEXT1.yyy
anothertext_OTHER.yyy-MAX.yyy
A few words regarding your splitting solution: This will also shorten the line
sometext_TEXT1xyyy - 1.xyyy
because when you interpolate a variable into a regex, the contents aren't matched literally. Instead, they are interpreted as a pattern (where . matches any non-newline codepoint)! You can avoid this by quoting all metacharacters with the \Q...\E escape:
s/\s*-\s*\Q$second\E$// if $first =~ /\Q$second\E$/;
When you use $2 Perl will try to interpolate that variable, but the variable will only be set after the match has completed. What you want, is a backreference, for which you need to use \2:
$text =~ s/(.*?)(.*)(\s*-\s*\2)/$1$2/;
Note that, when the replacement part is evaluated, $1 and $2 have been set and can be interpolated as expected. Also you could make the pattern a bit more concise (and probably more efficient), by using:
$text =~ s/(.*)\s*-\s*\2/$1/;
There is no need to match the initial part (.*?) if it's arbitrary and you just write it back anyway. What you might want to do though, is anchor the pattern to the end of the string:
$text =~ s/(.*)\s*-\s*\1$/$1/;
Otherwise (with your initial attempt or mine), you'd turn something-thingelse into somethingelse.

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

Is this line of Perl meaningless? s/^(\d+)\b/$1/sg

Does this line of Perl really do anything?
$variable =~ s/^(\d+)\b/$1/sg;
The only thing I can think of is that $1 or $& might be re-used, but it is immediately followed by.
$variable =~ s/\D//sg;
With these two lines together, is the first line meaningless and removable? It seems like it would be, but I have seen it multiple times in this old program, and wanted to make sure.
$variable =~ s/^(\d+)\b/$1/sg;
The anchor ^ at the beginning makes the /g modifier useless.
The lack of the wildcard character . in the string makes the /s modifier useless, since it serves to make . also match newline.
Since \b and ^ are zero-width assertions, and the only things outside the capture group, this substitution will not change the variable at all.
The only thing this regex does is capture the digits into $1, if they are found.
The subsequent regex
$variable =~ s/\D//sg;
Will remove all non-digits, making the variable just one long number. If one wanted to separate the first part (matched by the first regex), the only way to do so would be by accessing $1 from the first regex.
However, the first regex in that case would be better written simply:
$variable =~ /^(\d+)\b/;
And if the capture is supposed to be used:
my ($num) = $variable =~ /^(\d+)\b/;
Is "taint mode" in use? (Script is invoked with -T option.)
Maybe it's used to sanitize (i.e. untaint) user input.