Replace specific capture group instead of entire regex in Perl - regex

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.

As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;

If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"

This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}

I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.

Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

Related

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

Extract certain part of a string in Perl

I have the following Perl strings. The lengths and the patterns are different. The file is always named *log.999
my $file1 = '/user/mike/desktop/sys/syslog.1';
my $file2 = '/user/mike/desktop/movie/dnslog.2';
my $file3 = '/haselog.3';
my $file4 = '/user/mike/desktop/movie/dns-sys.log'
I need to extract the words before log. In this case, sys, dns, hase and dns-sys.
How can I write a regular expression to extract them?
\w+(?=log\b)
matches one or more alphanumeric characters that are followed by log (but not logging etc.)
If the filename format is fixed, you can make the regex more reliable by using
\w+(?=log\.\d+\/$)
The main property of shown strings is that the *log* phrase is last.
Then anchor the pattern, so we wouldn't match a log somewhere in the middle
my ($name) = $string =~ /(\w+)log\.[0-9]+$/;
while if .N extension is optional
my ($name) = $string =~ /(\w+)log(?:\.[0-9]+)?$/;
The above uses the \w+ pattern to capture the text preceding log. But that text may also contain non-word characters (-, ., etc), in which case we would use [^/]+ to capture everything after the last /, as pointed out in Abigail's answer. With .N optional, per question in the comments
my ($name) = $string =~ m{ ([^/]+) log (?: \.[0-9]+ )? $}x;
where I added the }x modifier, with which spaces inside are ignored, what can aid readibility.
I use a set of delimiters other than / to be able to use / inside without escaping it, and then the m is compulsory. The [^...] is a negated character class, matching any character not listed inside. So [^/]+log matches all successive characters which are not /, coming before log.
The non capturing group (?: ... ) groups patterns inside, so that ? applies to the whole group, but doesn't needlessly capture them.
The (?:\.[0-9]+)? pattern was written specifically so to disallow things like log. (nothing after dot) and log5. But if these are acceptable, change it to the simpler \.?[0-9]*
Update Corrected a typo in code: for optional .N there is +, not *

Perl : Decoding Regex

I would highly appreciate if somebody could help me understand the following.
=~/(?<![\w.])($val)(?![\w.])/gi)
This what i picked up but i dont understand this.
Lookaround: (?=a) for a lookahead, ?! for negative lookahead, or ?<= and ?<! for lookbehinds (positive and negative, respectively).
The regex seems to search for $val (i.e. string that matches the contents of the variable $val) not surrounded by word characters or dots.
Putting $val into parentheses remembers the corresponding matched part in $1.
See perlre for details.
Note that =~ is not part of the regex, it's the "binding operator".
Similarly, gi) is part of something bigger. g means the matching happens globally, which has different effects based on the context the matching occurs in, and i makes the match case insensitive (which could only influence $val here). The whole expression was in parentheses, probably, but we can't see the opening one.
Read (?<!PAT) as "not immediately preceded by text matching PAT".
Read (?!PAT) as "not immediately followed by text matching PAT".
I use these sites to help with testing and learning and decoding regex:
https://regex101.com/: This one dissects and explains the expression the best IMO.
http://www.regexr.com/
define $val then watch the regex engine work with rxrx - command-line REPL and wrapper for Regexp::Debugger
it shows output like this but in color
Matched
|
VVV
/(?<![\w.])(dog)(?![\w.])/
|
V
'The quick brown fox jumps over the lazy dog'
^^^
[Visual of regex at 'rxrx' line 0] [step: 189]
It also gives descriptions like this
(?<! # Match negative lookbehind
[\w.] # Match any of the listed characters
) # The end of negative lookbehind
( # The start of a capturing block ($1)
dog # Match a literal sequence ("dog")
) # The end of $1
(?! # Match negative lookahead
[\w.] # Match any of the listed characters
) # The end of negative lookahead

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;
When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.
The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).
The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.