Result of "regsub -all -- (a+)(ba*) aabaabxab {z\2} x" - regex

What's the meaning of {z\2} in this regular expression, especially \2?
regsub -all -- (a+)(ba*) aabaabxab {z\2} x
I got this result:
(bin) 58 % regsub -all -- (a+)(ba*) aabaabxab {z\2} x
2
(bin) 59 % puts $x
zbaabxzb
How does the expression match aabaabxab?

{z\2} tells your command to substitute all matches with z and the content of the second capturing group (\2).
The whole expression itself doesn't match aabaabxab!
What it matches are aabaa and ab at the end:
aabaabxab
^^^^^ ^^
Why? Because (a+)(ba*) means:
one or more as
(a+)
followed by a b
(b)
optionally followed by as (zero or more)
(a*)
So, the first match will be aabaa (seen from the beginning of the string). Now aa is the content of the first capturing group, baa is the content of the second. You now replace aabaa with z\2, thus zbaa.
See if you can figure it out yourself for the second match. :-)

Related

Regex for matching a specific pattern only if it doesn't match other pattern

I need to create a matching regex to find genetic sequences and I got stuck behind one specific problem - after first, start codon ATG, follows other codons from three nucleotides as well and the regex ends with three possible codons TAA, TAG and TGA. What if the stop(end) codon goes after the start(ATG) codon? My current regex works when there are intermediate codons between start and stop codon, but if there are none, the regex matches ALL of the sequence after start codon. I know why it does that, but I have no idea how to change it to work the way I want it to.
My regex should look for AGGAGG (exactly this pattern), then A, C, G or T (from 4 to 12 times) then ATG (exactly this pattern), then A, C, G or T (in triples (for example, ACG, TGC and etc.), doesn't matter how long) UNTIL it matches TAA, TAG or TGA. The search should end after that and start again after that.
Example of a good match:
XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
AGGAGGTATGATGCGTACGGGCTAGTAGAGGAGGTATGATGTAGTAGCATGCT
There are two matches in the sequence - from 0 to 25 and from 28 to 44.
My current regex(don't mind the first two brackets):
$seq =~ /(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3,3}){0,}(TAA|TAG|TGA)/ig
Problem here comes from the default usage of greedy quantifiers.
When using (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA), 4th group ([ACTG]{3})* will match as many as possible, then only 5th group is considered (backtracking if needed).
In your sequence you get TAGTAG. Greedy quantifier will lead to first TAG being captured in group 4, and second one captured as ending group.
You may use lazy quantifier instead: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA) (note the added question mark, making the quantifier lazy).
That way, first TAG encountered will be treated as the ending group.
Demo.
According to the pattern you gave, you could have overlapping matches. The following will find all matches, including overlapping matches:
local our #matches;
$seq =~ /
(
( AGGAGG )
( [ACGT]{4,12} )
( ATG )
( (?: (?! TAA|TAG|TGA ) [ACTG]{3} )* )
( TAA|TAG|TGA )
)
(?{ push #matches, [ $-[1], $1, $2, $3, $4, $5, $6 ] })
(?!)
/xg;
Perl essential regex feature, as opposed to plain regex like grep, is the lazy quantifier: ? following the * or + quantifier. it matches zero (one) or more occurrence of the character preceding * (+) token as the shortest glob match as possible
$seq =~ /((AGGAGG)([ACGT]{4,12})(ATG)([ACGT]{3})*?(TAA|TAG|TGA))/igx

Match reverse translation of a capture group in Perl regex

I am trying to find strings that match a certain pattern and then the reverse translation of that pattern followed by it separated by a letter O.
Translation rule is /ABC/XYZ.
Example of a match: CCBAOXYZZ
First section matches the pattern [ABC]{3,25}. Then there's a letter O which also matches. Then we see that XYZZ is the reverse of CCBA with the translation above applied.
I have managed to write the tr rule into my backreferencing. But I cannot figure out how to do the reverse as well.
while (my $input_string = <sample_input>) {
push #hit, $1 while $input_string
=~ m{
(([ABC]{3,25})
O
(??{ $2 =~ tr/ABC/XYZ/r}))
}xg;
}
Is it correct to add the 'reverse' function to the third line of the regex in this way: (??{ $2 =~ tr/ACGT/TGCA/r;reverse}))?
How do I match the reverse tr of $2?
Your tr///r returns the transliterated string. So you simply need to stick your reverse in front of the tr///r and you're good to go.
push #hit, $1 while $input_string
=~ m{
(([ABC]{3,25})
O
(??{ reverse $2 =~ tr/ABC/XYZ/r }))
}xg;
The return value of the tr///r does not go into $_, so ; reverse will reverse whatever is in $_. That makes the overall match fail.
You actually answered your own question in your last sentence.
How do I do the match the reverse tr of $2?
If you add use re 'debug' you can see the actual pattern that is being matched against.
With tr///; reverse, the second part of that debugging output, which relates to the regex compiled from the eval, is:
...
Compiling REx "ZZYXOABCC"
Final program:
1: EXACT <ZZYXOABCC> (5)
5: END (0)
anchored "ZZYXOABCC" at 0 (checking anchored isall) minlen 9
Matching embedded REx "ZZYXOABCC" against "XYZZ"
...
As we can see here, it took the full string as the second part of the match, after the O. It correctly reversed the left side of the string, but it returned the full string.
Now if we compare that to reverse tr///r, we see the difference.
...
Compiling REx "XYZZ"
Final program:
1: EXACT <XYZZ> (3)
3: END (0)
anchored "XYZZ" at 0 (checking anchored isall) minlen 4
Matching embedded REx "XYZZ" against "XYZZ"
...
It now only returns the transliterated left side of the string, which then matches.

Why is bracket mandatory here?

1 . ^([0-9A-Za-z]{5})+$
vs
2 . ^[a-zA-Z0-9]{5}+$
My intention is to match any string of length n such that n is a multiple of 5.
Check here : https://regex101.com/r/sS6rW8/1.
Please elaborate why case 1 matches the string whereas case 2 doesnot.
Because {n}+ doesn't mean what you think it does. In PCRE syntax, this turns {n} into a possessive quantifier. In other words, a{5}+ is the same as (?>a{5}). It's like the second + in the expression a++, which is the same as using an atomic group (?>a+).
This has no use with a fixed-length {n} but is more meaningful when used with {min,max}. So, a{2,5}+ is equivalent to (?>a{2,5}).
As a simple example, consider these patterns:
^(a{1,2})(ab) will match aab -> $1 is "a", $2 is "ab"
^(a{1,2}+)(ab) won't match aab -> $1 consumes "aa" possessively and $2 can't match
In ^([0-9A-Za-z]{5})+$ you're saying any number or letter 5 characters long 1 or more times. The + is on the entire group (whatever's inside the parentheses) and the {5} is on the [0-9A-Za-z]
Your second example has a no backtrack clause {5}+, which is different than (stuff{5})+

Perl regex and capturing groups

The following prints ac | a | bbb | c
#!/usr/bin/env perl
use strict;
use warnings;
# use re 'debug';
my $str = 'aacbbbcac';
if ($str =~ m/((a+)?(b+)?(c))*/) {
print "$1 | $2 | $3 | $4\n";
}
It seems like failed matches do not reset the captured group variables.
What am I missing?
it seems like failed matches dont reset the captured group variables
There is no failed matches in there. Your regex matches the string fine. Although there are some failed matches for inner groups in some repetition. Each matched group might be overwritten by the next match found for that particular group, or keep it's value from previous match, if that group is not matched in current repetition.
Let's see how regex match proceeds:
First (a+)?(b+)?(c) matches aac. Since (b+)? is optional, that will not be matched. At this stage, each capture group contains following part:
$1 contains entire match - aac
$2 contains (a+)? part - aa
$3 contains (b+)? part - null.
$4 contains (c) part - c
Since there is still some string left to match - bbbcac. Proceeding further - (a+)?(b+)?(c) matches - bbbc. Since (a+)? is optional, that won't be matched.
$1 contains entire match - bbbc. Overwrites the previous value in $1
$2 doesn't match. So, it will contain text previously matched - aa
$3 this time matches. It contains - bbb
$4 matches c
Again, (a+)?(b+)?(c) will go on to match the last part - ac.
$1 contains entire match - ac.
$2 matches a this time. Overwrites the previous value in $2. It now contains - a
$3 doesn't matches this time, as there is no (b+)? part. It will be same as previous match - bbb
$4 matches c. Overwrites the value from previous match. It now contains - c.
Now, there is nothing left in the string to match. The final value of all the capture groups are:
$1 - ac
$2 - a
$3 - bbb
$4 - c.
As odd as it seems this is the "expected" behavior. Here's a quote from the perlre docs:
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
For the parenthesis grouping, /(\d+)/ This documentation says to use \1 \2 ... or \g{1} \g{2}. Using $1 or $2... in a substitution regex part will cause an error like: scalar found in pattern
# Example to turn a css href to local css.
# Transforms <link href="http://..." into <link href="css/..."
# ... inside a loop ...
my $localcss = $_; # one line from the file
$localcss =~ s/href.+\/([^\/]+\.css")/href="css\/\1/g ;

What does this regular expression try to match?

These days I am learning regular expressions, but it seems like a little hard to me. I am reading some code in TCL, but what does it want to match?
regexp ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]" $input
If you un-escape the characters, you get the following:
.* ([\d]{3}:[\d]{3}:[\d]{3}.[\d]{5}).[^\n]
The term [\d]{x} would match x number of consecutive digits. Therefore, the portion inside the parentheses would match something of the form ###:###:###?##### (where # can be any digit and ? can be any character). The parentheses themselves aren't matched, they're just used for specifying what part of the input to "capture" and return to the caller. Following this sequence is a single dot ., which matches a single character (which can be anything). The trailing [^\n] will match a single character that is anything except a newline (a ^ at the start of a bracketed expression inverts the match). The .* term at the very beginning matches a sequence of characters of any length (even zero), followed by a space.
With all of this taken into account, it appears that this regular expression extracts a series of digits from the middle of a line. Given the format of the numbers, it may be looking for a timestamp in the hours:minutes:seconds.milliseconds format (although if that is the case, {1,3} and {1,5} should be used instead). The trailing .[^\n] term looks like it could be trying to exclude timestamps that are at or near the end of a line. Timestamped logs often have a timestamp followed by some sort of delimiting character (:, >, a space, etc). A regular expression like this might be used to extract timestamps from the log while ignoring "blank" lines that have a timestamp but no message.
Update:
Here's an example using TCL 8.4:
% set re ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]"
% regexp $re "TEST: 123:456:789:12345> sample log line"
1
% regexp $re " 111:222:333.44444 foo"
1
% regexp $re "111:222:333.44444 foo"
0
% regexp $re " 111:222:333.44444 "
0
% regexp $re " 10:44:56.12344: "
0
%
% regexp $re "TEST: 123:456:789:12345> sample log line" match data
1
% puts $match
TEST: 123:456:789:12345>
% puts $data
123:456:789:12345
The first two examples match the expression. The third fails because it lacks the space character before the first number sequence. The fourth fails because it doesn't have a non-newline character at the end after the trailing space. The fifth fails because the numerical sequences don't have enough digits. By passing parameters after the input, you can store the part of the input that matched the expression as well as the data that was "captured" by using parentheses. See the TCL wiki for details on the regexp command.
The interesting part with TCL is that you have to escape the [ character but not the ], while both the { and } need escaping.
.* ==> match junk part of the input
( ==> start capture
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}. ==> match 3 digits followed by any character
\[\\d]\{5\} ==> match 5 digits
). ==> close capture and match any character
\[^\\n] ==> match a character that is not a newline