Match reverse translation of a capture group in Perl regex - regex

I am trying to find strings that match a certain pattern and then the reverse translation of that pattern followed by it separated by a letter O.
Translation rule is /ABC/XYZ.
Example of a match: CCBAOXYZZ
First section matches the pattern [ABC]{3,25}. Then there's a letter O which also matches. Then we see that XYZZ is the reverse of CCBA with the translation above applied.
I have managed to write the tr rule into my backreferencing. But I cannot figure out how to do the reverse as well.
while (my $input_string = <sample_input>) {
push #hit, $1 while $input_string
=~ m{
(([ABC]{3,25})
O
(??{ $2 =~ tr/ABC/XYZ/r}))
}xg;
}
Is it correct to add the 'reverse' function to the third line of the regex in this way: (??{ $2 =~ tr/ACGT/TGCA/r;reverse}))?
How do I match the reverse tr of $2?

Your tr///r returns the transliterated string. So you simply need to stick your reverse in front of the tr///r and you're good to go.
push #hit, $1 while $input_string
=~ m{
(([ABC]{3,25})
O
(??{ reverse $2 =~ tr/ABC/XYZ/r }))
}xg;
The return value of the tr///r does not go into $_, so ; reverse will reverse whatever is in $_. That makes the overall match fail.
You actually answered your own question in your last sentence.
How do I do the match the reverse tr of $2?
If you add use re 'debug' you can see the actual pattern that is being matched against.
With tr///; reverse, the second part of that debugging output, which relates to the regex compiled from the eval, is:
...
Compiling REx "ZZYXOABCC"
Final program:
1: EXACT <ZZYXOABCC> (5)
5: END (0)
anchored "ZZYXOABCC" at 0 (checking anchored isall) minlen 9
Matching embedded REx "ZZYXOABCC" against "XYZZ"
...
As we can see here, it took the full string as the second part of the match, after the O. It correctly reversed the left side of the string, but it returned the full string.
Now if we compare that to reverse tr///r, we see the difference.
...
Compiling REx "XYZZ"
Final program:
1: EXACT <XYZZ> (3)
3: END (0)
anchored "XYZZ" at 0 (checking anchored isall) minlen 4
Matching embedded REx "XYZZ" against "XYZZ"
...
It now only returns the transliterated left side of the string, which then matches.

Related

Regex for matching a specific pattern only if it doesn't match other pattern

I need to create a matching regex to find genetic sequences and I got stuck behind one specific problem - after first, start codon ATG, follows other codons from three nucleotides as well and the regex ends with three possible codons TAA, TAG and TGA. What if the stop(end) codon goes after the start(ATG) codon? My current regex works when there are intermediate codons between start and stop codon, but if there are none, the regex matches ALL of the sequence after start codon. I know why it does that, but I have no idea how to change it to work the way I want it to.
My regex should look for AGGAGG (exactly this pattern), then A, C, G or T (from 4 to 12 times) then ATG (exactly this pattern), then A, C, G or T (in triples (for example, ACG, TGC and etc.), doesn't matter how long) UNTIL it matches TAA, TAG or TGA. The search should end after that and start again after that.
Example of a good match:
XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
AGGAGGTATGATGCGTACGGGCTAGTAGAGGAGGTATGATGTAGTAGCATGCT
There are two matches in the sequence - from 0 to 25 and from 28 to 44.
My current regex(don't mind the first two brackets):
$seq =~ /(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3,3}){0,}(TAA|TAG|TGA)/ig
Problem here comes from the default usage of greedy quantifiers.
When using (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA), 4th group ([ACTG]{3})* will match as many as possible, then only 5th group is considered (backtracking if needed).
In your sequence you get TAGTAG. Greedy quantifier will lead to first TAG being captured in group 4, and second one captured as ending group.
You may use lazy quantifier instead: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA) (note the added question mark, making the quantifier lazy).
That way, first TAG encountered will be treated as the ending group.
Demo.
According to the pattern you gave, you could have overlapping matches. The following will find all matches, including overlapping matches:
local our #matches;
$seq =~ /
(
( AGGAGG )
( [ACGT]{4,12} )
( ATG )
( (?: (?! TAA|TAG|TGA ) [ACTG]{3} )* )
( TAA|TAG|TGA )
)
(?{ push #matches, [ $-[1], $1, $2, $3, $4, $5, $6 ] })
(?!)
/xg;
Perl essential regex feature, as opposed to plain regex like grep, is the lazy quantifier: ? following the * or + quantifier. it matches zero (one) or more occurrence of the character preceding * (+) token as the shortest glob match as possible
$seq =~ /((AGGAGG)([ACGT]{4,12})(ATG)([ACGT]{3})*?(TAA|TAG|TGA))/igx

Perl regex and capturing groups

The following prints ac | a | bbb | c
#!/usr/bin/env perl
use strict;
use warnings;
# use re 'debug';
my $str = 'aacbbbcac';
if ($str =~ m/((a+)?(b+)?(c))*/) {
print "$1 | $2 | $3 | $4\n";
}
It seems like failed matches do not reset the captured group variables.
What am I missing?
it seems like failed matches dont reset the captured group variables
There is no failed matches in there. Your regex matches the string fine. Although there are some failed matches for inner groups in some repetition. Each matched group might be overwritten by the next match found for that particular group, or keep it's value from previous match, if that group is not matched in current repetition.
Let's see how regex match proceeds:
First (a+)?(b+)?(c) matches aac. Since (b+)? is optional, that will not be matched. At this stage, each capture group contains following part:
$1 contains entire match - aac
$2 contains (a+)? part - aa
$3 contains (b+)? part - null.
$4 contains (c) part - c
Since there is still some string left to match - bbbcac. Proceeding further - (a+)?(b+)?(c) matches - bbbc. Since (a+)? is optional, that won't be matched.
$1 contains entire match - bbbc. Overwrites the previous value in $1
$2 doesn't match. So, it will contain text previously matched - aa
$3 this time matches. It contains - bbb
$4 matches c
Again, (a+)?(b+)?(c) will go on to match the last part - ac.
$1 contains entire match - ac.
$2 matches a this time. Overwrites the previous value in $2. It now contains - a
$3 doesn't matches this time, as there is no (b+)? part. It will be same as previous match - bbb
$4 matches c. Overwrites the value from previous match. It now contains - c.
Now, there is nothing left in the string to match. The final value of all the capture groups are:
$1 - ac
$2 - a
$3 - bbb
$4 - c.
As odd as it seems this is the "expected" behavior. Here's a quote from the perlre docs:
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
For the parenthesis grouping, /(\d+)/ This documentation says to use \1 \2 ... or \g{1} \g{2}. Using $1 or $2... in a substitution regex part will cause an error like: scalar found in pattern
# Example to turn a css href to local css.
# Transforms <link href="http://..." into <link href="css/..."
# ... inside a loop ...
my $localcss = $_; # one line from the file
$localcss =~ s/href.+\/([^\/]+\.css")/href="css\/\1/g ;

Regex: Matching 4-Digits within words

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213
You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.
See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.
If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.
If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.
Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";

Perl Regular Expression extracting sub-string?

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?
You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)
\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/
use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.
This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

what does this line do in Perl? ($rowcol =~ m/([A-Z]?)([0-9]+)/);

What does this line do in Perl?
my #parsedarray = ($rowcol =~ m/([A-Z]?)([0-9]+)/);
$rowcol is something like A1, D8 etc... and I know that the script somehow splits them up because the next two lines are these:
my $row = $parsedarray[0];
my $col = $parsedarray[1];
I just can't see what this line does ($rowcol =~ m/([A-Z]?)([0-9]+)/); and how it works.
([A-Z]?) means capture at most one uppercase letter. ([0-9]+) means match and capture at least one digit.
Next time, you can install YAPE::Regex::Explain to tell you what's going on. eg
use YAPE::Regex::Explain;
my $regex = "([A-Z]?)([0-9]+)";
my $exp = YAPE::Regex::Explain->new($regex)->explain;
print $exp."\n";
Note that m// in list context returns all the captured sub-strings.
Broken into pieces, this is what's going on:
my #parsedarray = # declare an array, and assign it the results of:
(
$rowcol =~ # the results of $rowcol matched against
m/ # the pattern:
([A-Z]?) # 0 or 1 upper-case-alpha characters (1st submatch),
# followed by
([0-9]+) # 1 or more numeric characters (2nd submatch)
/x # this flag added to allow this verbose formatting
); # ...which in list context is all the submatches
So if $rowcal = 'D3':
my #parsedarray = ('D3' =~ m/([A-Z]?)([0-9]+)/); # which reduces to:
my #parsedarray = ('D', '3');
You can read about regular expressions in depth at perldoc perlrequick (a quick summary), perldoc perlretut (the tutorial), and perldoc perlre (all the details), and the various regular expression operations (matching, substitution, translation) at perldoc perlop.
The operator m// is a pattern match, basically a synonym of //. This matches an optional first letter and then 1 or more digits in row column. An array is returned as the result of the match with each element containing one of the matched groups (in brackets). Therefore $parsedarray[0] contains the letter (or nothing) and $parsedarray[1] contains the digits.
It:
matches against the regular expression (zero or more capitalised letters, followed by one or more numbers.)
Captures two groups:
zero or more letters
one or more numbers
Assigns those captured groups to the #parsedarray array
Example code to test:
use Data::Dumper;
my $rowcol = "A1";
my #parsedarray = ($rowcol =~ m/([A-Z]?)([0-9]+)/);
print Dumper(\#parsedarray);
yields:
$VAR1 = [
'A',
'1'
];
Note that if the string had no leading capitalised letter (e.g. "a1") then it would return an empty string for $parsedarray[0].
My Perl is a little rusty but if I understood your question, the answer is that it matches the regular expression of :
optional any capital letter between a through z followed by one or more number digits, and assigns it to rowcol